Reverse Engineering a Source System - Data Model (1 of 5)

Jody Hesch

Published Oct 30, 2024

Here’s the first of five or six articles I’m putting together that walk through a systematic approach for quickly reverse engineering a database / data warehouse, whether on-prem or cloud, and regardless of vendor. The goal is to significantly accelerate — and improve — how quickly you can understand the data, metadata, and relationships of entities in a system to help drive data modeling, ETL, and data observability initiatives within the larger context of building/maintaining a large scale data platform.

Content

Content discussed in this series includes:

Basics of reverse engineering a relational data model to identify subject areas, entities, relationships, join cardinality, etc.
Introduction/review of metadata to help identify table structures, columns, data types, constraints, relationships, job status, etc. to help inform and accelerate data modeling and ELT/ETL design.
Automated code generation for data profiling to help drive data quality / data observability requirements gathering as well as inform ELT/ETL design.
Graph metadata analysis to parse out and visualize dependencies across your SQL codebase
The role of LLMs in augmenting data dictionaries to help institutionalize tribal knowledge
How to quickly get a handle on complex stored procedure, table functions, etc.

A couple FYIs:

This tutorial pertains to source systems, but you might find it even more helpful for target systems, i.e. data warehouse instances that have been running for years that you’ve just been brought on to support and that you quickly have to get up to speed on.
This tutorial was put together on Snowflake, but the concepts are just as applicable to any other cloud or on-prem database / data warehouse. BigQuery, Redshift, Postgres, Oracle…

First up, a bit of context around when/how I originally established this approach..

Context

I had just finished two large projects over the past year at one of the Big 4 consulting firms for their Federal practice: 6 months each building data marts for Time and Expense reporting. As you can imagine, there’s quite a bit of compliance and bureaucracy that consulting firms must deal with when offering services to the government, so time/expense reporting takes on much importance than in typical consulting engagements.

Anywho. Once I finished those projects, I was then asked to join a financial transformation project, and more specifically, received this request from one of the Finance leaders, which is an all too familiar request for any Data Engineer that’s spent any length of time in the field:

Finance Leader: “Hey Jody, we’re wanting to bring in this consolidations data from this SQL Server instance. Can you have a look, see what you think we can do, and come back to me by the end of the week with any questions?”

Me: “Sure thing. Do you have an ERD, data dictionary, naming standards, or any other documentation I can peruse through while I start digging through the data?”

Finance Leader: “…I don’t think so, but I’ll double check. Meanwhile, just have a look at the data, see what you can come up with.”

This was somewhat earlier in my career, I still had some residual naiveté from my college years, with a basic expectation that I would receive documentation of source systems — and requirements for target systems — on a silver platter: everything spelled out, no ambiguity, nothing missing. All the information I needed to start data modeling, designing ETL logic, and cranking out some code in short order.

Obviously my expectations were disconnected from reality, so I had to figure out how to quickly “reverse engineer” this source system to figure out what it represented, what naming standards seemed to be in place, what the data model was, where the complexities lied, where data quality issues might be hiding, and start teasing out what kind of ETL logic I’d need to implement for whatever use cases we settle on.

The following tutorial walks through the approach I took, which continues to benefit me to this day. What it basically does is gives you a quick but rather comprehensive handle on:

The data
The metadata
The relationships

of the system you’re exploring. But before digging into these, let me quickly tackle prerequisite knowledge required for this tutorial.

Prerequisites

This tutorial, and most that I plan on publishing, are intended for early/mid stage Data Engineers (although hopefully Senior folks will get plenty of value out of them as well). Thus, this tutorial assumes basic familiarity with SQL and its commonly used constructs and functions, basics of relational databases (ERD diagrams, primary/foreign keys, etc.), technical comfort spinning up free trials of common platforms, etc.

The one function introduced later in this tutorial that perhaps folks haven’t necessarily used is string aggregation (which has different names in different databases) which basically takes string values from a given column across multiple rows and concatenates them in a single result column in a single row, typically in an order you define. Here is an example in Snowflake, demonstrating a grocery list ordered by department.

(Also, if you’re unfamiliar with common table expressions, also called CTEs, then take a moment to check them out via Google/ChatGPT — quite straightforward and helpful constructs for encapsulating and reusing SQL sub-queries.)

with 
  fake_data as 
  (
    select 'apple' as grocery_item, 'produce' as department union all 
    select 'beer' as grocery_item, 'liquor' as department union all 
    select 'carrots' as grocery_item, 'produce' as department union all 
    select 'yogurt' as grocery_item, 'dairy' as department union all 
    select 'wine' as grocery_item, 'liquor' as department union all 
    select 'cheddar' as grocery_item, 'dairy' as department
  )
-- 1) aggregate all of the grocery items
-- 2) use a comma and a space as the delimiter
-- 3) order by department
select 
    listagg(grocery_item, ', ') within group (order by department) as groceries
from 
    fake_data;

which results in this output, as expected:

Cool. We’ll circle back to this later, but first: let’s start the tutorial with one of the more important tools in the toolbox of any Data Engineer worth his/her salt: the data model (i.e. the metadata of your entities and how they are related to one another).

1) Reverse Engineering the Data Model

(The first thing any fellow old data dinosaurs probably thought of when reading the title of this article was probably, “Oh great, he’s just going to walk through how to ‘Reverse Engineer’ from some data modeling tool like PowerDesigner”… and you’d be right, for at least this section. So, if you’re already familiar with that capability in one tool, then you’re basically familiar with any other such tool, and you can skip this section entirely.)

My observation is that a lot of modern day use cases in data engineering really don’t depend on robust data models nearly as much as more traditional enterprise use cases. More specifically, subject areas like AdTech, MarTech, gaming data, telemetry data, genomic data, etc. often constitute one (or a handful of) massive event tables with a handful of relatively trivial contextual tables.

Nonetheless, for the times where you’re integrating enterprise data sources like ERP, CRM, etc. — it’s important to quickly understand things like cardinality, join conditions, primary/foreign/candidate keys, subject areas, etc. — all of which become much easier to parse out of an ERD than just a list of this information.

So, I decided to pick a more modern data modeling tool (SqlDBM) and use that in coordination with the most ubiquitous cloud data platform of the day, Snowflake.

Setting Up Your Environment

I’d like to make this (and other) tutorial(s) as simple as possible, and I think that goal is easily accomplished here. There’s really only two things you need to do:

Creating Your Data Model

Fortunately, Snowflake already comes with a database that represents a real-life transactional system that we’re going to use for this tutorial, the TPCDS_SF10TCL database within the SNOWFLAKE_SAMPLE_DATA schema that comes pre-populated in any Snowflake instance, based on the TPC-DS benchmark database.

Recommended by LinkedIn

🏗️ Building a Data Warehousing Project: Best…

Nilesh H. 1 year ago

Big Data Lake, Big ETL and Late Binding

Chandramohan Kannusamy 10 years ago

MS Fabric Data Stores: Simple Guide

Tereza Máchová 1 year ago

Now, in order to visualize the constituent data model for this database, we need to manually create the primary and foreign keys themselves, as it’s these relationships that drive the visual links between tables in an ERD.

(In case you’re wondering, standard Snowflake tables do not support enforcing referential integrity, which itself is often much less important for analytical use cases than transactional use cases. Hence there’s no real value for Snowflake to pre-bake these foreign keys into this sample data. Similar with primary keys. They don’t actually enforce uniqueness, sadly, so for the purposes of a benchmark data model, they, uh, serve no real purpose. Hence we need to add them.)

Moreover, we don’t have access to modifying these tables directly (since they’re part of a Snowflake share), so what we can easily do is recreate them in our own schema/database, and then add the foreign keys.

(Yes, this is a slightly convoluted approach to demonstrate a fairly simple concept. Appreciate your patience!)

After fighting a bit with ChatGPT, I was able to get it to generate everything we need. We’ll create a database, a schema, all of the corresponding tables, and then add in primary/foreign keys.

Go ahead and copy/paste/run the SQL script below in your Snowflake environment.

(PLEASE NOTE: The queries below do NOT actually copy any data, just the structures of the tables, and for those familiar with zero-copy clones, they are not supported from Snowflake shares — hence not used here.)


-- create my sandbox
CREATE DATABASE IF NOT EXISTS my_db;
CREATE SCHEMA IF NOT EXISTS my_schema;

-- create my tables without copying data 
CREATE TABLE IF NOT EXISTS my_db.my_schema.call_center LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.call_center;
CREATE TABLE IF NOT EXISTS my_db.my_schema.catalog_page LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.catalog_page;
CREATE TABLE IF NOT EXISTS my_db.my_schema.customer LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.customer;
CREATE TABLE IF NOT EXISTS my_db.my_schema.customer_address LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.customer_address;
CREATE TABLE IF NOT EXISTS my_db.my_schema.customer_demographics LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.customer_demographics;
CREATE TABLE IF NOT EXISTS my_db.my_schema.date_dim LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.date_dim;
CREATE TABLE IF NOT EXISTS my_db.my_schema.household_demographics LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.household_demographics;
CREATE TABLE IF NOT EXISTS my_db.my_schema.income_band LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.income_band;
CREATE TABLE IF NOT EXISTS my_db.my_schema.inventory LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.inventory;
CREATE TABLE IF NOT EXISTS my_db.my_schema.item LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.item;
CREATE TABLE IF NOT EXISTS my_db.my_schema.promotion LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.promotion;
CREATE TABLE IF NOT EXISTS my_db.my_schema.reason LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.reason;
CREATE TABLE IF NOT EXISTS my_db.my_schema.ship_mode LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.ship_mode;
CREATE TABLE IF NOT EXISTS my_db.my_schema.store LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.store;
CREATE TABLE IF NOT EXISTS my_db.my_schema.time_dim LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.time_dim;
CREATE TABLE IF NOT EXISTS my_db.my_schema.warehouse LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.warehouse;
CREATE TABLE IF NOT EXISTS my_db.my_schema.web_page LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.web_page;
CREATE TABLE IF NOT EXISTS my_db.my_schema.web_site LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.web_site;
CREATE TABLE IF NOT EXISTS my_db.my_schema.catalog_sales LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.catalog_sales;
CREATE TABLE IF NOT EXISTS my_db.my_schema.catalog_returns LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.catalog_returns;
CREATE TABLE IF NOT EXISTS my_db.my_schema.store_sales LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.store_sales;
CREATE TABLE IF NOT EXISTS my_db.my_schema.store_returns LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.store_returns;
CREATE TABLE IF NOT EXISTS my_db.my_schema.web_sales LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.web_sales;
CREATE TABLE IF NOT EXISTS my_db.my_schema.web_returns LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.web_returns;

-- add primary keys 
ALTER TABLE my_db.my_schema.call_center ADD CONSTRAINT pk_call_center PRIMARY KEY (cc_call_center_sk);
ALTER TABLE my_db.my_schema.catalog_page ADD CONSTRAINT pk_catalog_page PRIMARY KEY (cp_catalog_page_sk);
ALTER TABLE my_db.my_schema.customer ADD CONSTRAINT pk_customer PRIMARY KEY (c_customer_sk);
ALTER TABLE my_db.my_schema.customer_address ADD CONSTRAINT pk_customer_address PRIMARY KEY (ca_address_sk);
ALTER TABLE my_db.my_schema.customer_demographics ADD CONSTRAINT pk_customer_demographics PRIMARY KEY (cd_demo_sk);
ALTER TABLE my_db.my_schema.date_dim ADD CONSTRAINT pk_date_dim PRIMARY KEY (d_date_sk);
ALTER TABLE my_db.my_schema.household_demographics ADD CONSTRAINT pk_household_demographics PRIMARY KEY (hd_demo_sk);
ALTER TABLE my_db.my_schema.income_band ADD CONSTRAINT pk_income_band PRIMARY KEY (ib_income_band_sk);
ALTER TABLE my_db.my_schema.inventory ADD CONSTRAINT pk_inventory PRIMARY KEY (inv_item_sk, inv_warehouse_sk, inv_date_sk);
ALTER TABLE my_db.my_schema.item ADD CONSTRAINT pk_item PRIMARY KEY (i_item_sk);
ALTER TABLE my_db.my_schema.promotion ADD CONSTRAINT pk_promotion PRIMARY KEY (p_promo_sk);
ALTER TABLE my_db.my_schema.reason ADD CONSTRAINT pk_reason PRIMARY KEY (r_reason_sk);
ALTER TABLE my_db.my_schema.ship_mode ADD CONSTRAINT pk_ship_mode PRIMARY KEY (sm_ship_mode_sk);
ALTER TABLE my_db.my_schema.store ADD CONSTRAINT pk_store PRIMARY KEY (s_store_sk);
ALTER TABLE my_db.my_schema.time_dim ADD CONSTRAINT pk_time_dim PRIMARY KEY (t_time_sk);
ALTER TABLE my_db.my_schema.warehouse ADD CONSTRAINT pk_warehouse PRIMARY KEY (w_warehouse_sk);
ALTER TABLE my_db.my_schema.web_page ADD CONSTRAINT pk_web_page PRIMARY KEY (wp_web_page_sk);
ALTER TABLE my_db.my_schema.web_site ADD CONSTRAINT pk_web_site PRIMARY KEY (web_site_sk);

-- add foreign keys 
ALTER TABLE my_db.my_schema.store_sales ADD CONSTRAINT fk_store_sales_ss_customer_sk FOREIGN KEY (ss_customer_sk) REFERENCES my_db.my_schema.customer(c_customer_sk);
ALTER TABLE my_db.my_schema.store_sales ADD CONSTRAINT fk_store_sales_ss_item_sk FOREIGN KEY (ss_item_sk) REFERENCES my_db.my_schema.item(i_item_sk);
ALTER TABLE my_db.my_schema.store_sales ADD CONSTRAINT fk_store_sales_ss_sold_date_sk FOREIGN KEY (ss_sold_date_sk) REFERENCES my_db.my_schema.date_dim(d_date_sk);
ALTER TABLE my_db.my_schema.store_sales ADD CONSTRAINT fk_store_sales_ss_store_sk FOREIGN KEY (ss_store_sk) REFERENCES my_db.my_schema.store(s_store_sk);
ALTER TABLE my_db.my_schema.web_sales ADD CONSTRAINT fk_web_sales_ws_bill_customer_sk FOREIGN KEY (ws_bill_customer_sk) REFERENCES my_db.my_schema.customer(c_customer_sk);
ALTER TABLE my_db.my_schema.web_sales ADD CONSTRAINT fk_web_sales_ws_item_sk FOREIGN KEY (ws_item_sk) REFERENCES my_db.my_schema.item(i_item_sk);
ALTER TABLE my_db.my_schema.web_sales ADD CONSTRAINT fk_web_sales_ws_sold_date_sk FOREIGN KEY (ws_sold_date_sk) REFERENCES my_db.my_schema.date_dim(d_date_sk);
ALTER TABLE my_db.my_schema.catalog_sales ADD CONSTRAINT fk_catalog_sales_cs_bill_customer_sk FOREIGN KEY (cs_bill_customer_sk) REFERENCES my_db.my_schema.customer(c_customer_sk);
ALTER TABLE my_db.my_schema.catalog_sales ADD CONSTRAINT fk_catalog_sales_cs_item_sk FOREIGN KEY (cs_item_sk) REFERENCES my_db.my_schema.item(i_item_sk);
ALTER TABLE my_db.my_schema.catalog_sales ADD CONSTRAINT fk_catalog_sales_cs_sold_date_sk FOREIGN KEY (cs_sold_date_sk) REFERENCES my_db.my_schema.date_dim(d_date_sk);

Connecting SqlDBM

Once you’ve created a trial account in SqlDBM, you’ll then click the big blue “Create Project” button and then select “Snowflake” (make sure you select “Existing database”).

Then, select the “Connect to DW” button.

Finding the “Server” was slightly unintutive for me, so here’s a leg up.

Login to your Snowflake UI (keep in mind here that I’m using the newer Snowsight UI), and navigate to the “Copy account URL” from the bottom left of the screen as follows (this is via the new Snowsight UI. If you’re in the legacy Snowflake editor, you’ll probably need to do a quick Google to figure out where this is):

Then paste that URL into the “Server” of the SqlDBM form and enter your credentials. (I left “Role” blank without any problems.)

Then select “MY_DB” and “MY_SCHEMA” and continue on.

Keep the next modal window as-is, and click “Apply”.

Now, you should see a screen with the SQL table definitions listed out, and at the top of the screen you should see an arrow point upwards which says “Upload SQL Script”.

(Side rant for the SqlDBM UX folks, if they ever come across this article. I literally just connected to my DW. Why is that button so prominent again at the top? And why are you asking me to run that GET_DDL() function at the top after you literally just parsed it out of Snowflake without problems? End rant.)

Then you’ll have to once more click the blue “Import” button, and then you should be greeted, rather confusingly, just with a black screen.

Next you’ll need to select the “Diagram Explorer” and manually add all of your tables:

Once you’ve got all of your tables added, you can then modify the auto-layout, Zoom out a bit — and then you’re off to the races!

Your very own reverse-engineered entity-relationship diagram (ERD). Isn’t she cute??

Ok, in all seriousness (well, some seriousness), that was a fair bit of work, just to pop out an ERD diagram — but, hopefully worth it. ERD diagrams really are a critical tool for Data Engineers to be able to use and understand, even if you’re not working with very much enterprise data. You can’t get away from having to understand join conditions, join cardinality, and how to organize tables into subject areas, especially (like in my case) when you’re handed the keys to some completely unexpected source system that you need to quickly understand / reverse engineer.

(As is obvious — this blog post is not meant as a comprehensive tutorial of either data modeling as a discipline nor SqlDBM a a tool, so when it comes to things like adding subject areas, exploring data types, and really everything else that such data modeling tools have to offer — well, as the old expression goes — those are exercises left to the reader.)

And even in cases where you’re working primarily with semi-structured or unstructured data, you’re likely to be parsing out at least some JSON (or heaven forbid XML) into tabular structure, and you may even be doing so with unstructured data - in which case, you just can't get around data modeling.

Ok, enough about data modeling, for now.

Next up - Reverse Engineering a Source System - Metadata (2 of 5)

Kanha Shukla 1y

Fantastic insights, Jody H.! Reverse engineering is indeed critical for seamless data integration and modernization. Key takeaways I loved: · 🔍 Understanding Legacy Structures: Crucial for aligning new systems and ensuring consistent data flow. · 🔗 Mapping Relationships and Dependencies: Helps to identify the interconnected data elements that impact downstream analytics and reporting. · 🛠️ Schema Comparison Tools: I’ve found these useful for tracking changes over time, especially in rapidly evolving databases. · 📊 Boosting Data Quality: Reverse engineering often reveals gaps, enabling proactive data cleansing. Looking forward to the rest of your series—this guidance is a goldmine for data engineering professionals!

See more comments

To view or add a comment, sign in

Reverse Engineering a Source System - Data Model (1 of 5)

Jody Hesch

Context

Prerequisites

1) Reverse Engineering the Data Model

Setting Up Your Environment

Creating Your Data Model

Recommended by LinkedIn

Connecting SqlDBM

More articles by Jody Hesch

Others also viewed

Types of Tables in Snowflake

Building Data Value from the Ground Up: How ETL Connects Business Strategy and Data Engineering

Snowflake Data Sharing Architecture Part 1 of 4 - Introduction

Choosing the Right Approach: When to Use a Columnar Database, a Data Lake, or a Lakehouse Architecture

Data Modelling Considerations: Teradata to Snowflake Migration Challenges

Coupling Databricks and Snowflake

Data mesh – a fundamentals overview

Snowflake Data Projects: Your New Secret Weapon for Data Platform Automation

Harnessing Organization's Data Using AWS Glue

Explore content categories

Context

Prerequisites

1) Reverse Engineering the Data Model

Setting Up Your Environment

Creating Your Data Model

Recommended by LinkedIn

Connecting SqlDBM

More articles by Jody Hesch

Thinking in SQL Mathematically

Tied With A Bow: Wrapping Up the Hierarchy Discussion (Part 6 of 6)

Edge Cases: Handling Ragged and Unbalanced Hierarchies (Part 5 of 6)

Flat Out: Introducing Level Hierarchies (4 of 6)

Family Matters: Introducing Parent-Child Hierarchies (3 of 6)

More Than Pipelines: DAGs as Precursors to Hierarchies (2 of 6)

Nodes, Edges and Graphs: Providing Context for Hierarchies (1 of 6)

Reverse Engineering a Source System - Metadata-Driven Code Generation (3 of 5)

Reverse Engineering a Source System - Metadata (2 of 5)

Others also viewed

Types of Tables in Snowflake

Building Data Value from the Ground Up: How ETL Connects Business Strategy and Data Engineering

Snowflake Data Sharing Architecture Part 1 of 4 - Introduction

Choosing the Right Approach: When to Use a Columnar Database, a Data Lake, or a Lakehouse Architecture

Data Modelling Considerations: Teradata to Snowflake Migration Challenges

Coupling Databricks and Snowflake

Data mesh – a fundamentals overview

Snowflake Data Projects: Your New Secret Weapon for Data Platform Automation

Harnessing Organization's Data Using AWS Glue

Explore content categories