Reverse Engineering a Source System - Metadata-Driven Code Generation (3 of 5)

Reverse Engineering a Source System - Metadata-Driven Code Generation (3 of 5)


This article is the 3rd of a 5-part series that includes:


If you’ve made it this far, you’re in luck, as this is where the real fun begins.

Real quick, let’s just summarize the situation presented at the beginning of this post. You’re a Data Engineer newly placed on a Finance project at one of the Big 4 consulting firms, you’ve been handed the keys to a SQL Server instance of consolidations data, and asked to spend time looking at the data, see what you can do with it, and come back with any questions by the end of the week.

Now, clearly there are major questions to be asked about scope, timelines, resources, not even to mention what the actual end user requirements are. But, there’s plenty you can do to productively explore the system and get a sense of what it entails and a rough idea of what the scope would be to incorporate it into your data warehouse, as already demonstrated above.

You’ve now reverse engineered the ERD for the database, and you’ve spent time with the metadata, forming a pretty clear picture of the system from both at a high-level as well as fairly granular level.

Now, you want to spend time familiarizing yourself with the data itself. Where do you start? An early career data analyst might just start by doing data previews on each of the tables, and that’s of course not a bad place to start. It doesn’t take long to glance through all of the columns of all the tables and start to build an intuitive sense for what the data represents, how clean it is, and how you might want to model it in your data warehouse.

However, there’s very likely millions (if not billions) of records in a lot of your tables, which is way too much for the human brain to grok just by scrolling through data previews. You also can’t really figure out how sparse the data is in different tables/columns, how many distinct values there are, what columns might serve as candidate keys, etc.

So, you decide to put together a handful of statistics queries to get a sense of the “profile” of the data.

In case this is news to anyone — this is an actual discipline in data engineering called “data profiling”, and I feel the need to spell this out having once worked with one of Snowflake’s founding engineers who was unfamilar with the concept. (I can’t blame him though. His job was writing low-level distributed join algorithms themselves — not the daily data engineering grind that folks like you and I are up against.) But I’m digressing.

So, let’s put together a statistics “profile” of a single column from a single table from our TPC-DS database, based on what I called out above. More specifically, let’s look at:

  • the minimum value of the entire column
  • the maximum value of the entire column
  • what percentage of records are NULL for this column
  • the total number of distinct values of the column
  • whether or not this column might be a candidate key that might play a role in how I model its corresponding entity in my data warehouse

(These are, of course, just some of the statistics you might wonder about for a given column. You could also look into things like average, median, standard deviation, p95, max length of text fields, min length of text fields, etc.)

I’m going to arbitrarily pick a table and column for this. Here’s the query:

select 
    cast(min(CC_SQ_FT) as string) as min, 
    cast(max(CC_SQ_FT) as string) as max, 
    round(sum(cast(CC_SQ_FT is null as int)) / count(*) * 100.0, 2) as sparsity_perc, 
    count(distinct CC_SQ_FT) as num_distinct_vals, 
    count(distinct CC_SQ_FT) / count(*) > .98 as candidate_key
from 
    snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER         

and here’s the result:

Article content

Now we’re getting somewhere.

By profiling the data, I can start to form insights into the data itself, which is going to help me have a much more productive conversation with that Finance leader at the end of the week.

For example, if I had just skimmed through this particular table (CALL_CENTER) through a data preview, I might never have discovered that we have negative values within a “square feet” column, which makes no sense. So, I definitely want to capture this as a question to discuss with the Finance leader at the end of the week to understand what he wants me to do about this particular data quality issue.

I also can quickly see that this column is not sparse at all. There are no NULL values, so the quality of this data (in terms of its sparseness metric) is pretty high. Now, I should double check to see if the column is defined in the source system as NOT NULL. If it is (and as long as I also separately monitor source systems for schema changes), then I can safely ignore any required NULL data quality checks downstream for this particular column. Not to get bogged down into a data quality discussion (will save that for another day), the point here is simply to demonstrate how a data profiling query can quickly bubble helpful insights into a source system much more effectively than just your basic “data preview” queries.

Also, checking if a “square foot” column is a potential candidate key is obviously nonsense. The point here, though, is that I want to put together reusable data profiling queries that I can apply to all columns in any given table, to find the needles in the haystack.

And this brings me to my next point.

If I’m going to be running the same kind of query against all of the different columns of all of the different tables in a system, I’d really rather avoid having to hand-code literally thousands of such queries. That’s obviously well out of scope for any data engineer in this position.

What I’d much rather do is leverage my SQL skills for code generation. I’d rather have the system auto-generate all of the queries I want to run.

I’ve been waiting this entire time as I write this blog post to get to this point. I get much too excited talking about automated data profiling / code generation. I’ve worked with many Data Engineers with decades-long careers who have never dug deep into how far you can get with metadata, so I really hope this opens your eyes and gives you a whole new set of tools to take to your data engineering project.

Article content
How excited I am talking about metadata-driven code generation (Note: I am very sad LinkedIn articles do not support animated GIFs)

Ok, time for a quick lunch break… then let’s get back to it.


Metadata-driven Code Generation (Part 1)

Let’s start the discussion with a very simple query, i.e. just taking a single column from that last data profiling query above.

select 
    count(distinct CC_SQ_FT) as num_distinct_vals
from 
    snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER        

Now, I want to essentially copy/paste this same query but for other tables/columns in scope. How can I go about doing so? (Don’t say Excel, don’t say Excel…)

Here’s a query that does just that. It hard-codes the parts of the query that we’ll re-use, and it then dynamically populates the parts of the query that depend on things like the names of the database/schema/table/column:

select 
  'select count(distinct ' || column_name || ') as num_distinct_vals' 
  || ' from ' || table_catalog || '.' || table_name as query
from 
  columns 
where 
  true 
  and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
  and table_schema = 'TPCDS_SF10TCL'
  and table_name = 'CALL_CENTER'
order by 
  ordinal_position;        

Check out the result. I’ve just generated this same query for every column in the table CALL_CENTER! 🥳

Article content
Writing a query to write queries

So we’ve now got a powerful new tool in our Data Engineering toolbox: metadata-driven code generation (or as I like to call it: writing SQL to write SQL.)

Article content
Writing SQL to write SQL

Now, let’s make a couple more changes:

  1. There’s obviously a few more data profiling metrics we wanted to calculate
  2. I’d rather not have top copy/paste/evaluate tens or hundreds or thousands of such queries, so let’s figure out how to “stack” these queries and their results so that we can run it all in one fell swoop.

Ok, so, first thing — bringing in those additional metrics with the same approach (by hard-coding parts of the data profiling query that we want to re-use, and dynamically passing in any schema/table/column names).

I’m also going to include the names of the table and the columns in the result so that I don’t lost the context of what I’m analyzing, and I’m also going to grab the ORDINAL_POSITION column which indicates which position a column is defined in a given table (more on this shortly), and lastly I’ll grab the COMMENTS column is well for any additional insights into what a given column represents.

select 
    'select '
    || ' ''' || c.table_name || ''' as table_name' 
    || ', ''' || c.column_name || ''' as column_name' 
    || ', cast(min(' || c.column_name || ') as string) as min' 
    || ', cast(max(' || c.column_name || ') as string) as max' 
    || ', round(sum(cast(' || c.column_name || ' is null as int)) / count(*) * 100.0, 2) as sparsity_perc' 
    || ', count(distinct ' || c.column_name || ') as num_distinct_vals' 
    || ', count(distinct ' || c.column_name || ') / count(*) > .98 as candidate_key'
    || ', ' || c.ordinal_position || ' as ordinal_position'
    || ', ' || coalesce(c.comment, '''''') || ' as comments'
    || ' from ' 
    || 'snowflake_sample_data.tpcds_sf10tcl.' || c.table_name as query
from 
    columns c
where 
  true 
  and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
  and table_schema = 'TPCDS_SF10TCL'
  and table_name = 'CALL_CENTER'        

which as we can see has generated all of the queries we were hoping for:

Article content

Just to confirm, let’s pull out and format just one of these queries for a sanity check. Remember, this is just one such query of potential thousands that we automatically generated!

select  
  'CALL_CENTER' as table_name, 
  'CC_GMT_OFFSET' as column_name, 
  cast(min(CC_GMT_OFFSET) as string) as min, 
  cast(max(CC_GMT_OFFSET) as string) as max, 
  round(sum(cast((CC_GMT_OFFSET is null) as int)) / count(*) * 100.0, 2) as sparsity_perc, 
  count(distinct CC_GMT_OFFSET) as num_distinct_vals, 
  count(distinct CC_GMT_OFFSET) / count(*) > .98 as candidate_key, 
  30 as ordinal_position, 
  '' as comments 
from 
    snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER        

Very nice! We’ve got all of the additional metrics and metadata we wanted. Just a few more quick comments as well on this query:

  • As you can see, I’m casting the min() and max() values of the column as strings. This is required for the next step, where we’re going to “stack” each profiling query (and thus its result sets), which requires data types to match or else the query will fail.
  • The candidate_key boolean is somewhat arbitrary. I’m interested in understanding if the number of distinct values is close to the total number of records, which might indicate that it is a candidate key. How close? Up to you. I went with 98%, but you might want it to be 90%, 99.9%, etc. depending on your situation.
  • Speaking of that candidate key expression, it’s sometimes not obvious when such an expression results in a boolean value. A more explicit (but also redundant) expression would be:

case 
  when 
    count(distinct CC_GMT_OFFSET) / count(*) > .98
  then 
    true 
  else 
    false
end as candidate_key        

but we can dispense with the CASE statement altogether and just have the boolean expression evaluated directly (simpler code, probably better performance, slightly less obvious logic).

Now, remember, we also want to “stack” all of our profiling queries. Before modifying the code generation query, let’s just glance at an example of what we’re trying to achieve, now with two columns:

select  'CALL_CENTER' as table_name, 'CC_CALL_CENTER_SK' as column_name, cast(min(CC_CALL_CENTER_SK) as string) as min, cast(max(CC_CALL_CENTER_SK) as string) as max, round(sum(cast(CC_CALL_CENTER_SK is null as int)) / count(*) * 100.0, 2) as sparsity_perc, count(distinct CC_CALL_CENTER_SK) as num_distinct_vals, count(distinct CC_CALL_CENTER_SK) / count(*) > .98 as candidate_key, 1 as ordinal_position, '' as comments from snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER 

union all 

select  'CALL_CENTER' as table_name, 'CC_CALL_CENTER_ID' as column_name, cast(min(CC_CALL_CENTER_ID) as string) as min, cast(max(CC_CALL_CENTER_ID) as string) as max, round(sum(cast(CC_CALL_CENTER_ID is null as int)) / count(*) * 100.0, 2) as sparsity_perc, count(distinct CC_CALL_CENTER_ID) as num_distinct_vals, count(distinct CC_CALL_CENTER_ID) / count(*) > .98 as candidate_key, 2 as ordinal_position, '' as comments from snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER         

This is quite straightforward: it’s just two instances of our same query, but for two different columns in the same table, stacked with the UNION ALL set operator (technically UNION would’ve worked — I’ll argue for my preference for UNION ALL in this context some other day — but in short, better performance, and easier to debug).

Now, we already have a list of our profiling queries, so the question is — how do we combine them all together with this UNION ALL operator?

Go back to the introduction for this post and check out the example of using the LISTAGG() function for string aggregation. I told you it would eventually be needed. 😅 Keep in mind that the syntax can differ a fair bit in other systems (some call it STRING_AGG for example, some include the ORDER BY within the aggregation itself rather than this odd WITHIN GROUP syntax).

Also, as mentioned, you’ll need to familiarize yourself with CTEs if you haven’t already.

Ok, here’s the fully baked meta-driven code generation data profiling query (well, one such example):

with 
    profiling_queries as
    (
        select 
            'select '
            || ' ''' || c.table_name || ''' as table_name' 
            || ', ''' || c.column_name || ''' as column_name' 
            || ', cast(min(' || c.column_name || ') as string) as min' 
            || ', cast(max(' || c.column_name || ') as string) as max' 
            || ', round(sum(cast(' || c.column_name || ' is null as int)) / count(*) * 100.0, 2) as sparsity_perc' 
            || ', count(distinct ' || c.column_name || ') as num_distinct_vals' 
            || ', count(distinct ' || c.column_name || ') / count(*) > .98 as candidate_key'
            || ', ' || c.ordinal_position || ' as ordinal_position'
            || ', ' || coalesce(c.comment, '''''') || ' as comments'
            || ' from ' 
            || 'snowflake_sample_data.tpcds_sf10tcl.' || c.table_name as query
            ,c.table_catalog
            ,c.table_schema
            ,c.table_name
            ,c.ordinal_position as ordinal_position
        from 
            columns c
        where 
            true 
            and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
            and table_schema = 'TPCDS_SF10TCL'
            and table_name = 'CALL_CENTER'
    )
select 
    listagg(query, ' union all ') within group (order by ordinal_position) as all_profiling_queries
from
    profiling_queries
order by 
    table_name        

Now, run that query, copy the result into another SQL editor tab, and you’ll see this mess of code, which at first glance might be a bit frightening!

Article content

Don’t let it frighten you. Instead, bask in the glory of your creation...

Article content
Ju-Jitsu. I’m going to learn SQL Ju-Jitsu.

…by running it! Your result should look something like this:

Article content

Now we’ve got something we can really work with!

Remember, we’ve got a follow-up meeting with the Finance leader by the end of week, during which we can review any questions we’ve come up with. We want to be comprehensive but also targeted with our questions, and the profiling query above gives us a lot of insight from which we can put together meaningful questions, such as:

  • What does the “_SK” suffix mean in terms of naming conventions? And presumably the “CC_” prefix is a naming standard requiring an acronym of the table which the column belongs to?
  • Why is CC_CLOSED_DATE_SK entirely null? Does that imply no call centers have ever been closed?
  • CC_CALL_CENTER_ID sounds as though it’s a unique identifier, but it’s not unique across all records. Why?
  • CC_OPEN_DATE_SK appears to represent date values as Julian dates. Can you confirm? This will inform some data cleansing logic and needed type casting.
  • The max() value of CC_EMPLOYEES is larger than the total population of the USA. What is it actually representing? Is this just bad data?
  • CC_SQ_FT can have negative values. What is that supposed to represent?

So, in a matter of minutes, you’ve now got a fairly in-depth understanding of one of the tables in the source system, giving you an understanding of what kind of ETL logic you’ll need to start implementing, as well as questions you’ll need to review your project sponsor, i.e. the Finance leader.

Very cool!

Is it time for a beer yet? It’s 5 o’ clock somewhere…


There’s something rather important that I really need to emphasize. And I do actually mean this. This is very important.

Article content

If you’ve been paying attention, that table we’ve been analyzing, CALL_CENTER, is all of 54 records. This tutorial was setup with this table intentionally, as it’s not going to spike any particular credit burn.

However, we’ve just learned how to generate a rather massive query, and how to do so incredibly quickly and easily. If we’re not cautious, we could’ve run a similar query on STORE_SALES, clocking in at 28 billion records. And if we’re really not cautious, we could quickly find ourselves querying every column, of every row, of every table, unbounded, across billions and billions of records.

Article content

Don’t be that Data Engineer.

It’s your responsibility to be careful with how you implement the approach in this blog post, which by and large requires some kind of filtering logic to ensure you’re not scanning ridiculous amounts of data. Here are a few suggestions on how to do so, but you’ll need to make the right determination depending on your data and your use cases.

  1. Base all of your profiling logic against a CTE like this, to ensure your record counts are limited (based on whatever makes sense for you):

with cte_1000 as (select * from STORE_SALES limit 1000)        

2. Take a similar approach with Snowflake’s SAMPLE capability (highly suggest reading the docs in detail)

with cte_1000 as (select * from STORE_SALES sample system (10 rows))         

3. Figure out filter criteria that make sense for your purposes

with cte_1000 as (select * from STORE_SALES where SS_SOLD_DATE_SK = 2452570)        

Keep in mind that whichever approach you take, filtering/sampling/limiting your data is going to introduce some kind of skew and/or limitations in the results of your data profiling queries, so it’s important to understand what that impact will be so that you can still meaningfully leverage the results.

In my case, I rather lazily added logic that only generates data profiling queries for tables with less than 10 million records. With that and a few other bits of logic, here’s the “final” query:

with 
  smaller_tables as 
  (
    -- include table metric to prioritize analysis 
    select 
      table_catalog
      ,table_schema
      ,table_name
      ,row_count
      ,bytes
      ,comment
    from 
      tables 
    where 
      true 
      and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
      and table_schema = 'TPCDS_SF10TCL'
      and table_type = 'BASE TABLE'
      and row_count <= 10000000
  )
  ,profiling_queries as
  (
    select 
      'select '
      || ' ''' || c.table_name || ''' as table_name' 
      || ', ''' || c.column_name || ''' as column_name' 
      || ', cast(min(' || c.column_name || ') as string) as min' 
      || ', cast(max(' || c.column_name || ') as string) as max' 
      || ', round(sum(cast(' || c.column_name || ' is null as int)) / count(*) * 100.0, 2) as sparsity_perc' 
      || ', count(distinct ' || c.column_name || ') as num_distinct_vals' 
      || ', count(distinct ' || c.column_name || ') / count(*) > .98 as candidate_key'
      || ', ' || c.ordinal_position || ' as ordinal_position'
      || ', ' || coalesce(c.comment, '''''') || ' as comments'
      || ' from ' 
      || 'snowflake_sample_data.tpcds_sf10tcl.' || c.table_name as query
      ,c.table_catalog
      ,c.table_schema
      ,c.table_name
      ,c.ordinal_position as ordinal_position
      ,st.row_count
      ,st.bytes
      ,coalesce(st.comment, '') as comment
    from 
      columns c
    inner join
      smaller_tables st on 
        true 
        and c.table_catalog = st.table_catalog
        and c.table_schema = st.table_schema
        and c.table_name = st.table_name
  )
select 
  table_name,
  comment,
  row_count,
  bytes,
  count(*) as num_colums,
  listagg(query, ' union all ') within group (order by ordinal_position)
from
  profiling_queries
group by 
  1, 2, 3, 4
order by 
  table_name        

Let’s briefly review the results, as there are a few more things to discuss:

Article content

First of all, you can see how the query was slightly modified to now generate a data profiling query for each and every table in your schema. (Nothing would stop you from expanding even further to every schema in your particular database, but that’s probably too much for a single analysis.)

I’ve included metrics that might help prioritize which tables to look into first.

  • You could start with the low hanging fruit, i.e. the tables with the fewest columns
  • You’d probably want to save the tables with larger record counts for later, when you need to introduce additional filtering logic as referenced above for performance/credit reasons
  • Given the number of columns and rows, you’ll want to keep in mind raw data volume as well, captured by the BYTES column.

You’ll also see a handful of columns in the data profiling results that help you organize your analysis of the, uh, columns… that’s slightly confusing. Let me rephrase. Let’s look at the data profiling results again for a particular table:

Article content

Now that you’ve got your profiling results, you can review them by sorting your results a few different ways:

  • Sorting by ORDINAL_POSITION is going to give you insight into the intuitive ordering of columns by whichever human designed each database table. Roughly speaking, earlier columns are usually “more important” than later columns in a table.
  • Sorting by COLUMN_NAME will give you a more intuitive sense of naming conventions and help your brain start to get used to what all the table includes.
  • Sorting by SPARSITY_PERC is going to give you a sense of which columns might be more/less meaningful in the context of analytics.
  • Sorting by NUM_DISTINCT_VALS is going to give you a sense of the “density” of each column, how well they compress, what order you might want to specify them in indexes (in non-Snowflake platforms), as well as an intuition for their semantic content. In other words, a column with only two distinct values is very likely a “flag” column. A column with the same number of values as the record count of the table is very likely a unique identifier (i.e. “candidate key”). So on and so forth.

So, not only do you have detailed “data profiles” for all of the tables/columns in your database, you also have a number of different ways to slice/dice and sort your results. Why is this important? Well, to be honest, there’s a fair amount of obscure/esoteric/tribal knowledge and skills in Senior-level Data Engineers, much of which just kind of develops naturally/implicitly. In this particular case, I’ve come to find that more advanced Data Engineers can quickly come to an intuitive understanding of the data incredibly quickly (which I call “data fluency”), while more junior engineers will have to reference the ERD, the data dictionary, and or run queries a lot more often to confirm something from the system. What’s really happening is that more advanced Data Engineers establish more connections in their brain from what the data represents. Again consider two approaches:

  • Data Preview on each table (like a Junior Data Analyst might do)
  • Statistical data profiling queries on each table (min, max, average, string lengths, sparsity, p95, distinct values, etc.), sorted in multiple different ways (like a Senior Data Engineer might do)

The second approach is going to cement much more information in your brain about the data, and you’ll quickly become “fluent” with the data, quickly memorizing the important tables/columns, a sense of what each represents, what the edge cases are, what the formatting and data quality issues are, etc. This is a skill that is difficult to teach in a deterministic fashion, but I think this blog post overall gives a decent sense of how to start acquiring this skill.

Is it time for another beer or should we bring espresso into the mix? Espresso martini perhaps? ☕️🍸


Let me point out a few ways you might want to improve this data profiling query. By no means is this list comprehensive. I’m simply trying to show you a few more places to find more fish along with some better bait to use.

  1. Remove unnecessary logic to simplify your findings and improve performance and credit burn. For example, the number of distinct values of timestamps is rarely meaningful, so you could exclude this particular analysis by updating that portion of code as follows:

|| case when DATA_TYPE = 'TIMESTAMP' then NULL else ', count(distinct ' || c.column_name || ') ' END || 'as num_distinct_vals'         

In fact, you’ll likely want to do the same thing for decimal data that you might find in a FLOAT or NUMERIC field, and while you’re at it — you’ll likely want to exclude any BLOB/CLOB fields, VARIANT fields, ARRAY fields… plus more. You get the gist. You’re a Data Engineer. You can figure it out. 😉

2. Add more conditional logic specific to various data types. For example:

  • STRING/TEXT/VARCHAR — it can often be useful to analyze the min/max lengths of string fields.
  • TIMESTAMP — sometimes it can be helpful to check the min/max of timestamps at coarser levels of granularity, i.e. at the DATE level. Other times it can help to convert all timestamps to UTC, for example, to ensure you’re comparing apples to apples during your analysis.
  • ARRAY — similar to strings (which themselves are just arrays of characters), I often want to understand the smallest, largest, and average size of array columns.

3. Add more complex logic for your use case.

  • JSON — you may well need to pre-parse and analyze JSON data which, obviously, is more prone to data quality issues than highly structured data.
  • STRING/TEXT/VARCHAR — You can obviously get incredibly creative with various string functions and regular expressions to parse your data for certain formats (phone numbers, SSNs, etc.) as well as looking for things like GUIDs.
  • GEOGRAPHY/GEOMETRY — I got nothing. I so rarely work with this kind of data. Share something insightful with me to update this blog with, and I’ll give you a shout out on LinkedIn!

Otherwise though, you’ve now got everything you need to start putting this very powerful approach of metadata-driven code generation for your data profiling analysis to use to help you quickly reverse engineer a source system and increase your data fluency as you build out analytics solutions for your customer!

Article content

Metadata-driven Code Generation (Part 2)

But wait, there’s more!

Well, kind of.

One more query for you to take with you on your data profiling journey. This time, no more fine print — just the query itself, followed by a few comments.

select 
  table_name
  ,'select ' 
  || column_name
  || ', count(*) as record_count'
  || ' from ' || table_name || ' group by 1 order by 2 desc;'
from 
  columns 
where 
  true 
  and table_catalog = 'SNOWFLAKE_SAMPLE_DATA' 
  and table_schema = 'TPCDS_SF10TCL'
order by 
  table_name,
  ordinal_position;        

Again, let’s just look at one of the resulting generated queries.

The UI here is DBeaver on my Macbook, which I’ll explain shortly (i.e. why I’m not using the Snowflake UI).

Article content

This is what’s referred to as a “frequency distribution” query, and as you can tell, it just tells you how many records you’ll per each distinct value of a particular column.

The main point of this kind of query really is just to get that much more familiar with the data, get a sanity check on its quality, and better inform your expectations on what the data represents as you prepare to engage with stakeholders of sponsors to flesh out requirements and understand the system better (i.e. as you prepare to meet with the Finance leader at the end of the week in the contrived example above).

So, in this case, you can pretty easily see that there’s a gap in the data, i.e. there’s no value of 1999–01–01, which I would expect looking at the other values. So this is certainly something I’d want to ask the project sponsor about.

Secondly, you can also see that there are a lot more records in 1998 than any other year. That might be entirely expected, but it’s another example of a good/meaningful question that’s going to demonstrate to the sponsor that you’re doing your homework and not wasting his/her time with trivial questions that you could answer yourself.

Lastly, these are insights that you wouldn’t glean from either a simple “Data Preview” nor even the statistical profiling queries generated above in this post. Hence, I would definitely recommend including frequency distribution queries in your data profiling toolbox.

Real quick — you’ll note that in this case, I didn’t group all of the queries together in one big massive query daisy-chained with UNION ALL, like I did previously. Why?

Well, for one, I really want to investigate one column at a time, so I’d have to copy the results of such a massive query into Excel and then just filter one column at a time (or write a similar SQL filter). This doesn’t save me anytime compared to just flipping through result tabs in any modern SQL UI, so there’s no real value in daisy-chaining the queries together.

Secondly, it’s possible that some of these queries can produce a pretty massive result set, i.e. for those columns with high cardinality (i.e. high number of distinct values). So, your analysis for other columns can just get buried when the number of results exceeds your UI’s result set limit (usually 1000 or so). So, again, no value in daisy-chaining the results.

Unfortunately, Snowflake’s new Snowsight UI doesn’t support displaying multiple result sets after running multiple queries (although it does let you run them, which makes no sense for read-only queries). Hence, I downloaded DBeaver to accomplish this. There are a million other UIs you could use, I just picked one randomly.

But as you can see below, after running my data profiling query, I was able to copy the results, i.e. all of the subsequent frequency distribution queries, into DBeaver and run all of them, with multiple result sets available in the bottom of the screen (in the case of DBeaver, you just have to run “Execute SQL Script” rather than “Execute SQL Query”).

Article content

Now, all the same caveats as before, including:

  • Be mindful of how expensive such queries will be on your larger tables and include filter/sample/limit logic as applicable.
  • Frequency distribution queries offer little value for high cardinality columns (again, things like timestamps, decimals, JSON/VARIANT/CLOBS, etc.), so you should probably filter those out.
  • As before, remember that your brain will better comprehend the data if you explore it a few different ways. So, considering modifying your generation query to produce your profile queries in different orders (i.e. order alphanumerically, then by ORDINAL_POSITION, then by sparsity, then by number of distinct values. You really will become much more “fluent” with the data, much more quickly, as your brain processes the various patterns that you’ll see.)
  • Also as before, consider modifying the generation query so that your profile queries sort by the column itself, rather than the count of records. Again, just another way to look at the data for your brain to grasp the different patterns and become that much more fluent with the data.

Time for a night cap…


Final Thoughts

Personas

This tutorial really is based on a real-world project as described. The project sponsor, i.e. the one funding the project, was also the one with the tribal knowledge of the SQL Server instance in scope that I went back to with questions.

However, that’s probably the only project I’ve been on where the sponsor and the technical champion are the same person. As such, don’t read much into my cavalier use of terms in this article. In real life consulting and sales engineering engagements, it’s obviously critical to understand who your economic buyer, sponsor, champion, stakeholders, end users, etc. are. Fleshing out all the different roles, responsibilities, org charts, RACI charts of data engineering project is well outside the scope of this article — so just take that aspect of this post with a grain of salt.

Data Technology

It should go without saying that the analysis described in this post is going to be much more efficient in Snowflake (or any column store database) than SQL Server (or any row store database). With ELT being the more common pattern of the day in most cloud data engineering projects, I’ll state the obvious that you should plan on conducting most of your data profiling in your target system, after extracting raw data from your source systems. This would also be the case with your data lake architecture for any inefficient storage formats, such as CSV or JSON files. However, if you’re querying columnar formats from your lake, such as Parquet or Iceberg, you’re obviously going to get a lot more mileage.

One more comment, to slightly contradict myself. If you have an on-premise source system and data volumes that aren’t absurdly huge, and especially if your source system is some kind of analytical database, i.e. something with column storage, decent hardware, etc. — then, you might be in a position where it makes sense to batch up a lot of expensive data profiling queries over night. You’ll obviously spend way less on the electricity running your box under someone’s desk — then burning through the same credits in your cloud system. This case is much more the exception than the rule, but still worth keeping in mind.

Code Generation

One of the most clever Data Architects I ever hired brought a fully-fledged ETL framework to the table when he came on board. He had a very similar approach: leveraging metadata to auto-generate a significant amount of his ETL code, along with a very robust error-handling and logging framework. Some day I might ask him if it’s something he’s willing to share, but what I’ve tried to emphasize, is that it’s often much more about teaching a man/woman to fish than giving a man/woman a fish. Thus, I would encourage you to take the skills learned in this post and think through how you can extend them for other data engineering needs such as data observability and ETL.

COTS Software

When I first introduced the term “data profiling” (or really any of the other technical terms in this post and any others I write), and if you hadn’t previously heard of it, did you look it up? Here’s the Wikipedia article. One of my pet peeves on LinkedIn is so-called “influencers” publishing well-understood patterns/solutions from data engineering as if they had invented it. So, I obviously did not invent data profiling, but it is true that it’s rare to find this level of analysis as a skill of most Data Engineers. So, I do truly hope you found it valuable.

I also want to call out that data profiling is typically in the purview of the discipline of data quality (these days called “data observability”). Legacy tools like Informatica and BusinessObjects Information Steward, as well as more modern data observability solutions like Monte Carlo and Anomalo, all come up with out-of-the-box data profiling capabilities, as it’s critical to profile source systems to understand what kind of data cleansing logic you need to apply before loading the consumption layer of your data warehouse.

So, if these capabilities are available “out of the box”, why bother with all this hand-coded SQL?

Well, the most important reason, I would say, is that there are many data analytics projects that simply don’t have the budget for a fully-fledged end-to-end enterprise data warehouse with all the bells and whistles, including data quality / data observability tools. So, it’s that much more important to share these kinds of skills with the market so that Data Engineers can still be highly effective under limited budgets.

Secondly, my experience in life really has been that you’re more effective if you can show all your work (as opposed to using calculators or software), not less. It’s just like in high school. The kid who could estimate sine and cosine values of any given angle (I can’t and I have a math degree) without a scientific calculator had a much better grasp of the concepts, learned new concepts much quicker, and made much fewer mistakes. So, too, with data engineering. The more you know how to do “the hard way”, the better you’ll be even when executing your work “the easy way”.

Lastly — personal preference. Many folks in IT do as much as they can via command line rather than GUI-driven workflows. I find myself constantly tweaking my queries, and I’m often much faster and more efficient doing so just in a SQL console rather than fiddling with the various dashboarding features in a data profiling tool.

Article content

As you can tell, this is my first blog post in a very long time detailing rather technical content. As such, please forgive any rough edges, points of confusion, etc. I really enjoyed writing this article, and I truly would value any/all feedback! Do you see a way to improve this post? Whether it’s removing confusion, adding clarity, adding additional concepts, or if you simply want to share whether you enjoyed (or hated it), let me know! A DM on LinkedIn or an email to jody@heschconsulting.com would be fantastic. Thanks so much!

To view or add a comment, sign in

More articles by Jody Hesch

Explore content categories