Reverse Engineering a Source System - Data Model (1 of 5)
Here’s the first of five or six articles I’m putting together that walk through a systematic approach for quickly reverse engineering a database / data warehouse, whether on-prem or cloud, and regardless of vendor. The goal is to significantly accelerate — and improve — how quickly you can understand the data, metadata, and relationships of entities in a system to help drive data modeling, ETL, and data observability initiatives within the larger context of building/maintaining a large scale data platform.
Content
Content discussed in this series includes:
A couple FYIs:
First up, a bit of context around when/how I originally established this approach..
Context
I had just finished two large projects over the past year at one of the Big 4 consulting firms for their Federal practice: 6 months each building data marts for Time and Expense reporting. As you can imagine, there’s quite a bit of compliance and bureaucracy that consulting firms must deal with when offering services to the government, so time/expense reporting takes on much importance than in typical consulting engagements.
Anywho. Once I finished those projects, I was then asked to join a financial transformation project, and more specifically, received this request from one of the Finance leaders, which is an all too familiar request for any Data Engineer that’s spent any length of time in the field:
Finance Leader: “Hey Jody, we’re wanting to bring in this consolidations data from this SQL Server instance. Can you have a look, see what you think we can do, and come back to me by the end of the week with any questions?”
Me: “Sure thing. Do you have an ERD, data dictionary, naming standards, or any other documentation I can peruse through while I start digging through the data?”
Finance Leader: “…I don’t think so, but I’ll double check. Meanwhile, just have a look at the data, see what you can come up with.”
This was somewhat earlier in my career, I still had some residual naiveté from my college years, with a basic expectation that I would receive documentation of source systems — and requirements for target systems — on a silver platter: everything spelled out, no ambiguity, nothing missing. All the information I needed to start data modeling, designing ETL logic, and cranking out some code in short order.
Obviously my expectations were disconnected from reality, so I had to figure out how to quickly “reverse engineer” this source system to figure out what it represented, what naming standards seemed to be in place, what the data model was, where the complexities lied, where data quality issues might be hiding, and start teasing out what kind of ETL logic I’d need to implement for whatever use cases we settle on.
The following tutorial walks through the approach I took, which continues to benefit me to this day. What it basically does is gives you a quick but rather comprehensive handle on:
of the system you’re exploring. But before digging into these, let me quickly tackle prerequisite knowledge required for this tutorial.
Prerequisites
This tutorial, and most that I plan on publishing, are intended for early/mid stage Data Engineers (although hopefully Senior folks will get plenty of value out of them as well). Thus, this tutorial assumes basic familiarity with SQL and its commonly used constructs and functions, basics of relational databases (ERD diagrams, primary/foreign keys, etc.), technical comfort spinning up free trials of common platforms, etc.
The one function introduced later in this tutorial that perhaps folks haven’t necessarily used is string aggregation (which has different names in different databases) which basically takes string values from a given column across multiple rows and concatenates them in a single result column in a single row, typically in an order you define. Here is an example in Snowflake, demonstrating a grocery list ordered by department.
(Also, if you’re unfamiliar with common table expressions, also called CTEs, then take a moment to check them out via Google/ChatGPT — quite straightforward and helpful constructs for encapsulating and reusing SQL sub-queries.)
with
fake_data as
(
select 'apple' as grocery_item, 'produce' as department union all
select 'beer' as grocery_item, 'liquor' as department union all
select 'carrots' as grocery_item, 'produce' as department union all
select 'yogurt' as grocery_item, 'dairy' as department union all
select 'wine' as grocery_item, 'liquor' as department union all
select 'cheddar' as grocery_item, 'dairy' as department
)
-- 1) aggregate all of the grocery items
-- 2) use a comma and a space as the delimiter
-- 3) order by department
select
listagg(grocery_item, ', ') within group (order by department) as groceries
from
fake_data;
which results in this output, as expected:
Cool. We’ll circle back to this later, but first: let’s start the tutorial with one of the more important tools in the toolbox of any Data Engineer worth his/her salt: the data model (i.e. the metadata of your entities and how they are related to one another).
1) Reverse Engineering the Data Model
(The first thing any fellow old data dinosaurs probably thought of when reading the title of this article was probably, “Oh great, he’s just going to walk through how to ‘Reverse Engineer’ from some data modeling tool like PowerDesigner”… and you’d be right, for at least this section. So, if you’re already familiar with that capability in one tool, then you’re basically familiar with any other such tool, and you can skip this section entirely.)
My observation is that a lot of modern day use cases in data engineering really don’t depend on robust data models nearly as much as more traditional enterprise use cases. More specifically, subject areas like AdTech, MarTech, gaming data, telemetry data, genomic data, etc. often constitute one (or a handful of) massive event tables with a handful of relatively trivial contextual tables.
Nonetheless, for the times where you’re integrating enterprise data sources like ERP, CRM, etc. — it’s important to quickly understand things like cardinality, join conditions, primary/foreign/candidate keys, subject areas, etc. — all of which become much easier to parse out of an ERD than just a list of this information.
So, I decided to pick a more modern data modeling tool (SqlDBM) and use that in coordination with the most ubiquitous cloud data platform of the day, Snowflake.
Setting Up Your Environment
I’d like to make this (and other) tutorial(s) as simple as possible, and I think that goal is easily accomplished here. There’s really only two things you need to do:
Creating Your Data Model
Fortunately, Snowflake already comes with a database that represents a real-life transactional system that we’re going to use for this tutorial, the TPCDS_SF10TCL database within the SNOWFLAKE_SAMPLE_DATA schema that comes pre-populated in any Snowflake instance, based on the TPC-DS benchmark database.
Recommended by LinkedIn
Now, in order to visualize the constituent data model for this database, we need to manually create the primary and foreign keys themselves, as it’s these relationships that drive the visual links between tables in an ERD.
(In case you’re wondering, standard Snowflake tables do not support enforcing referential integrity, which itself is often much less important for analytical use cases than transactional use cases. Hence there’s no real value for Snowflake to pre-bake these foreign keys into this sample data. Similar with primary keys. They don’t actually enforce uniqueness, sadly, so for the purposes of a benchmark data model, they, uh, serve no real purpose. Hence we need to add them.)
Moreover, we don’t have access to modifying these tables directly (since they’re part of a Snowflake share), so what we can easily do is recreate them in our own schema/database, and then add the foreign keys.
(Yes, this is a slightly convoluted approach to demonstrate a fairly simple concept. Appreciate your patience!)
After fighting a bit with ChatGPT, I was able to get it to generate everything we need. We’ll create a database, a schema, all of the corresponding tables, and then add in primary/foreign keys.
Go ahead and copy/paste/run the SQL script below in your Snowflake environment.
(PLEASE NOTE: The queries below do NOT actually copy any data, just the structures of the tables, and for those familiar with zero-copy clones, they are not supported from Snowflake shares — hence not used here.)
-- create my sandbox
CREATE DATABASE IF NOT EXISTS my_db;
CREATE SCHEMA IF NOT EXISTS my_schema;
-- create my tables without copying data
CREATE TABLE IF NOT EXISTS my_db.my_schema.call_center LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.call_center;
CREATE TABLE IF NOT EXISTS my_db.my_schema.catalog_page LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.catalog_page;
CREATE TABLE IF NOT EXISTS my_db.my_schema.customer LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.customer;
CREATE TABLE IF NOT EXISTS my_db.my_schema.customer_address LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.customer_address;
CREATE TABLE IF NOT EXISTS my_db.my_schema.customer_demographics LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.customer_demographics;
CREATE TABLE IF NOT EXISTS my_db.my_schema.date_dim LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.date_dim;
CREATE TABLE IF NOT EXISTS my_db.my_schema.household_demographics LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.household_demographics;
CREATE TABLE IF NOT EXISTS my_db.my_schema.income_band LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.income_band;
CREATE TABLE IF NOT EXISTS my_db.my_schema.inventory LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.inventory;
CREATE TABLE IF NOT EXISTS my_db.my_schema.item LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.item;
CREATE TABLE IF NOT EXISTS my_db.my_schema.promotion LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.promotion;
CREATE TABLE IF NOT EXISTS my_db.my_schema.reason LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.reason;
CREATE TABLE IF NOT EXISTS my_db.my_schema.ship_mode LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.ship_mode;
CREATE TABLE IF NOT EXISTS my_db.my_schema.store LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.store;
CREATE TABLE IF NOT EXISTS my_db.my_schema.time_dim LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.time_dim;
CREATE TABLE IF NOT EXISTS my_db.my_schema.warehouse LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.warehouse;
CREATE TABLE IF NOT EXISTS my_db.my_schema.web_page LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.web_page;
CREATE TABLE IF NOT EXISTS my_db.my_schema.web_site LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.web_site;
CREATE TABLE IF NOT EXISTS my_db.my_schema.catalog_sales LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.catalog_sales;
CREATE TABLE IF NOT EXISTS my_db.my_schema.catalog_returns LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.catalog_returns;
CREATE TABLE IF NOT EXISTS my_db.my_schema.store_sales LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.store_sales;
CREATE TABLE IF NOT EXISTS my_db.my_schema.store_returns LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.store_returns;
CREATE TABLE IF NOT EXISTS my_db.my_schema.web_sales LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.web_sales;
CREATE TABLE IF NOT EXISTS my_db.my_schema.web_returns LIKE SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.web_returns;
-- add primary keys
ALTER TABLE my_db.my_schema.call_center ADD CONSTRAINT pk_call_center PRIMARY KEY (cc_call_center_sk);
ALTER TABLE my_db.my_schema.catalog_page ADD CONSTRAINT pk_catalog_page PRIMARY KEY (cp_catalog_page_sk);
ALTER TABLE my_db.my_schema.customer ADD CONSTRAINT pk_customer PRIMARY KEY (c_customer_sk);
ALTER TABLE my_db.my_schema.customer_address ADD CONSTRAINT pk_customer_address PRIMARY KEY (ca_address_sk);
ALTER TABLE my_db.my_schema.customer_demographics ADD CONSTRAINT pk_customer_demographics PRIMARY KEY (cd_demo_sk);
ALTER TABLE my_db.my_schema.date_dim ADD CONSTRAINT pk_date_dim PRIMARY KEY (d_date_sk);
ALTER TABLE my_db.my_schema.household_demographics ADD CONSTRAINT pk_household_demographics PRIMARY KEY (hd_demo_sk);
ALTER TABLE my_db.my_schema.income_band ADD CONSTRAINT pk_income_band PRIMARY KEY (ib_income_band_sk);
ALTER TABLE my_db.my_schema.inventory ADD CONSTRAINT pk_inventory PRIMARY KEY (inv_item_sk, inv_warehouse_sk, inv_date_sk);
ALTER TABLE my_db.my_schema.item ADD CONSTRAINT pk_item PRIMARY KEY (i_item_sk);
ALTER TABLE my_db.my_schema.promotion ADD CONSTRAINT pk_promotion PRIMARY KEY (p_promo_sk);
ALTER TABLE my_db.my_schema.reason ADD CONSTRAINT pk_reason PRIMARY KEY (r_reason_sk);
ALTER TABLE my_db.my_schema.ship_mode ADD CONSTRAINT pk_ship_mode PRIMARY KEY (sm_ship_mode_sk);
ALTER TABLE my_db.my_schema.store ADD CONSTRAINT pk_store PRIMARY KEY (s_store_sk);
ALTER TABLE my_db.my_schema.time_dim ADD CONSTRAINT pk_time_dim PRIMARY KEY (t_time_sk);
ALTER TABLE my_db.my_schema.warehouse ADD CONSTRAINT pk_warehouse PRIMARY KEY (w_warehouse_sk);
ALTER TABLE my_db.my_schema.web_page ADD CONSTRAINT pk_web_page PRIMARY KEY (wp_web_page_sk);
ALTER TABLE my_db.my_schema.web_site ADD CONSTRAINT pk_web_site PRIMARY KEY (web_site_sk);
-- add foreign keys
ALTER TABLE my_db.my_schema.store_sales ADD CONSTRAINT fk_store_sales_ss_customer_sk FOREIGN KEY (ss_customer_sk) REFERENCES my_db.my_schema.customer(c_customer_sk);
ALTER TABLE my_db.my_schema.store_sales ADD CONSTRAINT fk_store_sales_ss_item_sk FOREIGN KEY (ss_item_sk) REFERENCES my_db.my_schema.item(i_item_sk);
ALTER TABLE my_db.my_schema.store_sales ADD CONSTRAINT fk_store_sales_ss_sold_date_sk FOREIGN KEY (ss_sold_date_sk) REFERENCES my_db.my_schema.date_dim(d_date_sk);
ALTER TABLE my_db.my_schema.store_sales ADD CONSTRAINT fk_store_sales_ss_store_sk FOREIGN KEY (ss_store_sk) REFERENCES my_db.my_schema.store(s_store_sk);
ALTER TABLE my_db.my_schema.web_sales ADD CONSTRAINT fk_web_sales_ws_bill_customer_sk FOREIGN KEY (ws_bill_customer_sk) REFERENCES my_db.my_schema.customer(c_customer_sk);
ALTER TABLE my_db.my_schema.web_sales ADD CONSTRAINT fk_web_sales_ws_item_sk FOREIGN KEY (ws_item_sk) REFERENCES my_db.my_schema.item(i_item_sk);
ALTER TABLE my_db.my_schema.web_sales ADD CONSTRAINT fk_web_sales_ws_sold_date_sk FOREIGN KEY (ws_sold_date_sk) REFERENCES my_db.my_schema.date_dim(d_date_sk);
ALTER TABLE my_db.my_schema.catalog_sales ADD CONSTRAINT fk_catalog_sales_cs_bill_customer_sk FOREIGN KEY (cs_bill_customer_sk) REFERENCES my_db.my_schema.customer(c_customer_sk);
ALTER TABLE my_db.my_schema.catalog_sales ADD CONSTRAINT fk_catalog_sales_cs_item_sk FOREIGN KEY (cs_item_sk) REFERENCES my_db.my_schema.item(i_item_sk);
ALTER TABLE my_db.my_schema.catalog_sales ADD CONSTRAINT fk_catalog_sales_cs_sold_date_sk FOREIGN KEY (cs_sold_date_sk) REFERENCES my_db.my_schema.date_dim(d_date_sk);
Connecting SqlDBM
Once you’ve created a trial account in SqlDBM, you’ll then click the big blue “Create Project” button and then select “Snowflake” (make sure you select “Existing database”).
Then, select the “Connect to DW” button.
Finding the “Server” was slightly unintutive for me, so here’s a leg up.
Login to your Snowflake UI (keep in mind here that I’m using the newer Snowsight UI), and navigate to the “Copy account URL” from the bottom left of the screen as follows (this is via the new Snowsight UI. If you’re in the legacy Snowflake editor, you’ll probably need to do a quick Google to figure out where this is):
Then paste that URL into the “Server” of the SqlDBM form and enter your credentials. (I left “Role” blank without any problems.)
Then select “MY_DB” and “MY_SCHEMA” and continue on.
Keep the next modal window as-is, and click “Apply”.
Now, you should see a screen with the SQL table definitions listed out, and at the top of the screen you should see an arrow point upwards which says “Upload SQL Script”.
(Side rant for the SqlDBM UX folks, if they ever come across this article. I literally just connected to my DW. Why is that button so prominent again at the top? And why are you asking me to run that GET_DDL() function at the top after you literally just parsed it out of Snowflake without problems? End rant.)
Then you’ll have to once more click the blue “Import” button, and then you should be greeted, rather confusingly, just with a black screen.
Next you’ll need to select the “Diagram Explorer” and manually add all of your tables:
Once you’ve got all of your tables added, you can then modify the auto-layout, Zoom out a bit — and then you’re off to the races!
Your very own reverse-engineered entity-relationship diagram (ERD). Isn’t she cute??
Ok, in all seriousness (well, some seriousness), that was a fair bit of work, just to pop out an ERD diagram — but, hopefully worth it. ERD diagrams really are a critical tool for Data Engineers to be able to use and understand, even if you’re not working with very much enterprise data. You can’t get away from having to understand join conditions, join cardinality, and how to organize tables into subject areas, especially (like in my case) when you’re handed the keys to some completely unexpected source system that you need to quickly understand / reverse engineer.
(As is obvious — this blog post is not meant as a comprehensive tutorial of either data modeling as a discipline nor SqlDBM a a tool, so when it comes to things like adding subject areas, exploring data types, and really everything else that such data modeling tools have to offer — well, as the old expression goes — those are exercises left to the reader.)
And even in cases where you’re working primarily with semi-structured or unstructured data, you’re likely to be parsing out at least some JSON (or heaven forbid XML) into tabular structure, and you may even be doing so with unstructured data - in which case, you just can't get around data modeling.
Ok, enough about data modeling, for now.
Fantastic insights, Jody H.! Reverse engineering is indeed critical for seamless data integration and modernization. Key takeaways I loved: · 🔍 Understanding Legacy Structures: Crucial for aligning new systems and ensuring consistent data flow. · 🔗 Mapping Relationships and Dependencies: Helps to identify the interconnected data elements that impact downstream analytics and reporting. · 🛠️ Schema Comparison Tools: I’ve found these useful for tracking changes over time, especially in rapidly evolving databases. · 📊 Boosting Data Quality: Reverse engineering often reveals gaps, enabling proactive data cleansing. Looking forward to the rest of your series—this guidance is a goldmine for data engineering professionals!