Reverse Engineering a Source System - Metadata-Driven Code Generation (3 of 5)
This article is the 3rd of a 5-part series that includes:
If you’ve made it this far, you’re in luck, as this is where the real fun begins.
Real quick, let’s just summarize the situation presented at the beginning of this post. You’re a Data Engineer newly placed on a Finance project at one of the Big 4 consulting firms, you’ve been handed the keys to a SQL Server instance of consolidations data, and asked to spend time looking at the data, see what you can do with it, and come back with any questions by the end of the week.
Now, clearly there are major questions to be asked about scope, timelines, resources, not even to mention what the actual end user requirements are. But, there’s plenty you can do to productively explore the system and get a sense of what it entails and a rough idea of what the scope would be to incorporate it into your data warehouse, as already demonstrated above.
You’ve now reverse engineered the ERD for the database, and you’ve spent time with the metadata, forming a pretty clear picture of the system from both at a high-level as well as fairly granular level.
Now, you want to spend time familiarizing yourself with the data itself. Where do you start? An early career data analyst might just start by doing data previews on each of the tables, and that’s of course not a bad place to start. It doesn’t take long to glance through all of the columns of all the tables and start to build an intuitive sense for what the data represents, how clean it is, and how you might want to model it in your data warehouse.
However, there’s very likely millions (if not billions) of records in a lot of your tables, which is way too much for the human brain to grok just by scrolling through data previews. You also can’t really figure out how sparse the data is in different tables/columns, how many distinct values there are, what columns might serve as candidate keys, etc.
So, you decide to put together a handful of statistics queries to get a sense of the “profile” of the data.
In case this is news to anyone — this is an actual discipline in data engineering called “data profiling”, and I feel the need to spell this out having once worked with one of Snowflake’s founding engineers who was unfamilar with the concept. (I can’t blame him though. His job was writing low-level distributed join algorithms themselves — not the daily data engineering grind that folks like you and I are up against.) But I’m digressing.
So, let’s put together a statistics “profile” of a single column from a single table from our TPC-DS database, based on what I called out above. More specifically, let’s look at:
(These are, of course, just some of the statistics you might wonder about for a given column. You could also look into things like average, median, standard deviation, p95, max length of text fields, min length of text fields, etc.)
I’m going to arbitrarily pick a table and column for this. Here’s the query:
select
cast(min(CC_SQ_FT) as string) as min,
cast(max(CC_SQ_FT) as string) as max,
round(sum(cast(CC_SQ_FT is null as int)) / count(*) * 100.0, 2) as sparsity_perc,
count(distinct CC_SQ_FT) as num_distinct_vals,
count(distinct CC_SQ_FT) / count(*) > .98 as candidate_key
from
snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER
and here’s the result:
Now we’re getting somewhere.
By profiling the data, I can start to form insights into the data itself, which is going to help me have a much more productive conversation with that Finance leader at the end of the week.
For example, if I had just skimmed through this particular table (CALL_CENTER) through a data preview, I might never have discovered that we have negative values within a “square feet” column, which makes no sense. So, I definitely want to capture this as a question to discuss with the Finance leader at the end of the week to understand what he wants me to do about this particular data quality issue.
I also can quickly see that this column is not sparse at all. There are no NULL values, so the quality of this data (in terms of its sparseness metric) is pretty high. Now, I should double check to see if the column is defined in the source system as NOT NULL. If it is (and as long as I also separately monitor source systems for schema changes), then I can safely ignore any required NULL data quality checks downstream for this particular column. Not to get bogged down into a data quality discussion (will save that for another day), the point here is simply to demonstrate how a data profiling query can quickly bubble helpful insights into a source system much more effectively than just your basic “data preview” queries.
Also, checking if a “square foot” column is a potential candidate key is obviously nonsense. The point here, though, is that I want to put together reusable data profiling queries that I can apply to all columns in any given table, to find the needles in the haystack.
And this brings me to my next point.
If I’m going to be running the same kind of query against all of the different columns of all of the different tables in a system, I’d really rather avoid having to hand-code literally thousands of such queries. That’s obviously well out of scope for any data engineer in this position.
What I’d much rather do is leverage my SQL skills for code generation. I’d rather have the system auto-generate all of the queries I want to run.
I’ve been waiting this entire time as I write this blog post to get to this point. I get much too excited talking about automated data profiling / code generation. I’ve worked with many Data Engineers with decades-long careers who have never dug deep into how far you can get with metadata, so I really hope this opens your eyes and gives you a whole new set of tools to take to your data engineering project.
Ok, time for a quick lunch break… then let’s get back to it.
Metadata-driven Code Generation (Part 1)
Let’s start the discussion with a very simple query, i.e. just taking a single column from that last data profiling query above.
select
count(distinct CC_SQ_FT) as num_distinct_vals
from
snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER
Now, I want to essentially copy/paste this same query but for other tables/columns in scope. How can I go about doing so? (Don’t say Excel, don’t say Excel…)
Here’s a query that does just that. It hard-codes the parts of the query that we’ll re-use, and it then dynamically populates the parts of the query that depend on things like the names of the database/schema/table/column:
select
'select count(distinct ' || column_name || ') as num_distinct_vals'
|| ' from ' || table_catalog || '.' || table_name as query
from
columns
where
true
and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
and table_schema = 'TPCDS_SF10TCL'
and table_name = 'CALL_CENTER'
order by
ordinal_position;
Check out the result. I’ve just generated this same query for every column in the table CALL_CENTER! 🥳
So we’ve now got a powerful new tool in our Data Engineering toolbox: metadata-driven code generation (or as I like to call it: writing SQL to write SQL.)
Now, let’s make a couple more changes:
Ok, so, first thing — bringing in those additional metrics with the same approach (by hard-coding parts of the data profiling query that we want to re-use, and dynamically passing in any schema/table/column names).
I’m also going to include the names of the table and the columns in the result so that I don’t lost the context of what I’m analyzing, and I’m also going to grab the ORDINAL_POSITION column which indicates which position a column is defined in a given table (more on this shortly), and lastly I’ll grab the COMMENTS column is well for any additional insights into what a given column represents.
select
'select '
|| ' ''' || c.table_name || ''' as table_name'
|| ', ''' || c.column_name || ''' as column_name'
|| ', cast(min(' || c.column_name || ') as string) as min'
|| ', cast(max(' || c.column_name || ') as string) as max'
|| ', round(sum(cast(' || c.column_name || ' is null as int)) / count(*) * 100.0, 2) as sparsity_perc'
|| ', count(distinct ' || c.column_name || ') as num_distinct_vals'
|| ', count(distinct ' || c.column_name || ') / count(*) > .98 as candidate_key'
|| ', ' || c.ordinal_position || ' as ordinal_position'
|| ', ' || coalesce(c.comment, '''''') || ' as comments'
|| ' from '
|| 'snowflake_sample_data.tpcds_sf10tcl.' || c.table_name as query
from
columns c
where
true
and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
and table_schema = 'TPCDS_SF10TCL'
and table_name = 'CALL_CENTER'
which as we can see has generated all of the queries we were hoping for:
Just to confirm, let’s pull out and format just one of these queries for a sanity check. Remember, this is just one such query of potential thousands that we automatically generated!
select
'CALL_CENTER' as table_name,
'CC_GMT_OFFSET' as column_name,
cast(min(CC_GMT_OFFSET) as string) as min,
cast(max(CC_GMT_OFFSET) as string) as max,
round(sum(cast((CC_GMT_OFFSET is null) as int)) / count(*) * 100.0, 2) as sparsity_perc,
count(distinct CC_GMT_OFFSET) as num_distinct_vals,
count(distinct CC_GMT_OFFSET) / count(*) > .98 as candidate_key,
30 as ordinal_position,
'' as comments
from
snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER
Very nice! We’ve got all of the additional metrics and metadata we wanted. Just a few more quick comments as well on this query:
case
when
count(distinct CC_GMT_OFFSET) / count(*) > .98
then
true
else
false
end as candidate_key
but we can dispense with the CASE statement altogether and just have the boolean expression evaluated directly (simpler code, probably better performance, slightly less obvious logic).
Now, remember, we also want to “stack” all of our profiling queries. Before modifying the code generation query, let’s just glance at an example of what we’re trying to achieve, now with two columns:
select 'CALL_CENTER' as table_name, 'CC_CALL_CENTER_SK' as column_name, cast(min(CC_CALL_CENTER_SK) as string) as min, cast(max(CC_CALL_CENTER_SK) as string) as max, round(sum(cast(CC_CALL_CENTER_SK is null as int)) / count(*) * 100.0, 2) as sparsity_perc, count(distinct CC_CALL_CENTER_SK) as num_distinct_vals, count(distinct CC_CALL_CENTER_SK) / count(*) > .98 as candidate_key, 1 as ordinal_position, '' as comments from snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER
union all
select 'CALL_CENTER' as table_name, 'CC_CALL_CENTER_ID' as column_name, cast(min(CC_CALL_CENTER_ID) as string) as min, cast(max(CC_CALL_CENTER_ID) as string) as max, round(sum(cast(CC_CALL_CENTER_ID is null as int)) / count(*) * 100.0, 2) as sparsity_perc, count(distinct CC_CALL_CENTER_ID) as num_distinct_vals, count(distinct CC_CALL_CENTER_ID) / count(*) > .98 as candidate_key, 2 as ordinal_position, '' as comments from snowflake_sample_data.tpcds_sf10tcl.CALL_CENTER
This is quite straightforward: it’s just two instances of our same query, but for two different columns in the same table, stacked with the UNION ALL set operator (technically UNION would’ve worked — I’ll argue for my preference for UNION ALL in this context some other day — but in short, better performance, and easier to debug).
Now, we already have a list of our profiling queries, so the question is — how do we combine them all together with this UNION ALL operator?
Go back to the introduction for this post and check out the example of using the LISTAGG() function for string aggregation. I told you it would eventually be needed. 😅 Keep in mind that the syntax can differ a fair bit in other systems (some call it STRING_AGG for example, some include the ORDER BY within the aggregation itself rather than this odd WITHIN GROUP syntax).
Also, as mentioned, you’ll need to familiarize yourself with CTEs if you haven’t already.
Ok, here’s the fully baked meta-driven code generation data profiling query (well, one such example):
with
profiling_queries as
(
select
'select '
|| ' ''' || c.table_name || ''' as table_name'
|| ', ''' || c.column_name || ''' as column_name'
|| ', cast(min(' || c.column_name || ') as string) as min'
|| ', cast(max(' || c.column_name || ') as string) as max'
|| ', round(sum(cast(' || c.column_name || ' is null as int)) / count(*) * 100.0, 2) as sparsity_perc'
|| ', count(distinct ' || c.column_name || ') as num_distinct_vals'
|| ', count(distinct ' || c.column_name || ') / count(*) > .98 as candidate_key'
|| ', ' || c.ordinal_position || ' as ordinal_position'
|| ', ' || coalesce(c.comment, '''''') || ' as comments'
|| ' from '
|| 'snowflake_sample_data.tpcds_sf10tcl.' || c.table_name as query
,c.table_catalog
,c.table_schema
,c.table_name
,c.ordinal_position as ordinal_position
from
columns c
where
true
and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
and table_schema = 'TPCDS_SF10TCL'
and table_name = 'CALL_CENTER'
)
select
listagg(query, ' union all ') within group (order by ordinal_position) as all_profiling_queries
from
profiling_queries
order by
table_name
Now, run that query, copy the result into another SQL editor tab, and you’ll see this mess of code, which at first glance might be a bit frightening!
Don’t let it frighten you. Instead, bask in the glory of your creation...
…by running it! Your result should look something like this:
Now we’ve got something we can really work with!
Remember, we’ve got a follow-up meeting with the Finance leader by the end of week, during which we can review any questions we’ve come up with. We want to be comprehensive but also targeted with our questions, and the profiling query above gives us a lot of insight from which we can put together meaningful questions, such as:
So, in a matter of minutes, you’ve now got a fairly in-depth understanding of one of the tables in the source system, giving you an understanding of what kind of ETL logic you’ll need to start implementing, as well as questions you’ll need to review your project sponsor, i.e. the Finance leader.
Very cool!
Is it time for a beer yet? It’s 5 o’ clock somewhere…
There’s something rather important that I really need to emphasize. And I do actually mean this. This is very important.
If you’ve been paying attention, that table we’ve been analyzing, CALL_CENTER, is all of 54 records. This tutorial was setup with this table intentionally, as it’s not going to spike any particular credit burn.
However, we’ve just learned how to generate a rather massive query, and how to do so incredibly quickly and easily. If we’re not cautious, we could’ve run a similar query on STORE_SALES, clocking in at 28 billion records. And if we’re really not cautious, we could quickly find ourselves querying every column, of every row, of every table, unbounded, across billions and billions of records.
Don’t be that Data Engineer.
It’s your responsibility to be careful with how you implement the approach in this blog post, which by and large requires some kind of filtering logic to ensure you’re not scanning ridiculous amounts of data. Here are a few suggestions on how to do so, but you’ll need to make the right determination depending on your data and your use cases.
with cte_1000 as (select * from STORE_SALES limit 1000)
2. Take a similar approach with Snowflake’s SAMPLE capability (highly suggest reading the docs in detail)
with cte_1000 as (select * from STORE_SALES sample system (10 rows))
3. Figure out filter criteria that make sense for your purposes
with cte_1000 as (select * from STORE_SALES where SS_SOLD_DATE_SK = 2452570)
Keep in mind that whichever approach you take, filtering/sampling/limiting your data is going to introduce some kind of skew and/or limitations in the results of your data profiling queries, so it’s important to understand what that impact will be so that you can still meaningfully leverage the results.
In my case, I rather lazily added logic that only generates data profiling queries for tables with less than 10 million records. With that and a few other bits of logic, here’s the “final” query:
with
smaller_tables as
(
-- include table metric to prioritize analysis
select
table_catalog
,table_schema
,table_name
,row_count
,bytes
,comment
from
tables
where
true
and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
and table_schema = 'TPCDS_SF10TCL'
and table_type = 'BASE TABLE'
and row_count <= 10000000
)
,profiling_queries as
(
select
'select '
|| ' ''' || c.table_name || ''' as table_name'
|| ', ''' || c.column_name || ''' as column_name'
|| ', cast(min(' || c.column_name || ') as string) as min'
|| ', cast(max(' || c.column_name || ') as string) as max'
|| ', round(sum(cast(' || c.column_name || ' is null as int)) / count(*) * 100.0, 2) as sparsity_perc'
|| ', count(distinct ' || c.column_name || ') as num_distinct_vals'
|| ', count(distinct ' || c.column_name || ') / count(*) > .98 as candidate_key'
|| ', ' || c.ordinal_position || ' as ordinal_position'
|| ', ' || coalesce(c.comment, '''''') || ' as comments'
|| ' from '
|| 'snowflake_sample_data.tpcds_sf10tcl.' || c.table_name as query
,c.table_catalog
,c.table_schema
,c.table_name
,c.ordinal_position as ordinal_position
,st.row_count
,st.bytes
,coalesce(st.comment, '') as comment
from
columns c
inner join
smaller_tables st on
true
and c.table_catalog = st.table_catalog
and c.table_schema = st.table_schema
and c.table_name = st.table_name
)
select
table_name,
comment,
row_count,
bytes,
count(*) as num_colums,
listagg(query, ' union all ') within group (order by ordinal_position)
from
profiling_queries
group by
1, 2, 3, 4
order by
table_name
Let’s briefly review the results, as there are a few more things to discuss:
First of all, you can see how the query was slightly modified to now generate a data profiling query for each and every table in your schema. (Nothing would stop you from expanding even further to every schema in your particular database, but that’s probably too much for a single analysis.)
I’ve included metrics that might help prioritize which tables to look into first.
You’ll also see a handful of columns in the data profiling results that help you organize your analysis of the, uh, columns… that’s slightly confusing. Let me rephrase. Let’s look at the data profiling results again for a particular table:
Now that you’ve got your profiling results, you can review them by sorting your results a few different ways:
So, not only do you have detailed “data profiles” for all of the tables/columns in your database, you also have a number of different ways to slice/dice and sort your results. Why is this important? Well, to be honest, there’s a fair amount of obscure/esoteric/tribal knowledge and skills in Senior-level Data Engineers, much of which just kind of develops naturally/implicitly. In this particular case, I’ve come to find that more advanced Data Engineers can quickly come to an intuitive understanding of the data incredibly quickly (which I call “data fluency”), while more junior engineers will have to reference the ERD, the data dictionary, and or run queries a lot more often to confirm something from the system. What’s really happening is that more advanced Data Engineers establish more connections in their brain from what the data represents. Again consider two approaches:
The second approach is going to cement much more information in your brain about the data, and you’ll quickly become “fluent” with the data, quickly memorizing the important tables/columns, a sense of what each represents, what the edge cases are, what the formatting and data quality issues are, etc. This is a skill that is difficult to teach in a deterministic fashion, but I think this blog post overall gives a decent sense of how to start acquiring this skill.
Is it time for another beer or should we bring espresso into the mix? Espresso martini perhaps? ☕️🍸
Let me point out a few ways you might want to improve this data profiling query. By no means is this list comprehensive. I’m simply trying to show you a few more places to find more fish along with some better bait to use.
|| case when DATA_TYPE = 'TIMESTAMP' then NULL else ', count(distinct ' || c.column_name || ') ' END || 'as num_distinct_vals'
In fact, you’ll likely want to do the same thing for decimal data that you might find in a FLOAT or NUMERIC field, and while you’re at it — you’ll likely want to exclude any BLOB/CLOB fields, VARIANT fields, ARRAY fields… plus more. You get the gist. You’re a Data Engineer. You can figure it out. 😉
2. Add more conditional logic specific to various data types. For example:
3. Add more complex logic for your use case.
Otherwise though, you’ve now got everything you need to start putting this very powerful approach of metadata-driven code generation for your data profiling analysis to use to help you quickly reverse engineer a source system and increase your data fluency as you build out analytics solutions for your customer!
Metadata-driven Code Generation (Part 2)
But wait, there’s more!
Well, kind of.
One more query for you to take with you on your data profiling journey. This time, no more fine print — just the query itself, followed by a few comments.
select
table_name
,'select '
|| column_name
|| ', count(*) as record_count'
|| ' from ' || table_name || ' group by 1 order by 2 desc;'
from
columns
where
true
and table_catalog = 'SNOWFLAKE_SAMPLE_DATA'
and table_schema = 'TPCDS_SF10TCL'
order by
table_name,
ordinal_position;
Again, let’s just look at one of the resulting generated queries.
The UI here is DBeaver on my Macbook, which I’ll explain shortly (i.e. why I’m not using the Snowflake UI).
This is what’s referred to as a “frequency distribution” query, and as you can tell, it just tells you how many records you’ll per each distinct value of a particular column.
The main point of this kind of query really is just to get that much more familiar with the data, get a sanity check on its quality, and better inform your expectations on what the data represents as you prepare to engage with stakeholders of sponsors to flesh out requirements and understand the system better (i.e. as you prepare to meet with the Finance leader at the end of the week in the contrived example above).
So, in this case, you can pretty easily see that there’s a gap in the data, i.e. there’s no value of 1999–01–01, which I would expect looking at the other values. So this is certainly something I’d want to ask the project sponsor about.
Secondly, you can also see that there are a lot more records in 1998 than any other year. That might be entirely expected, but it’s another example of a good/meaningful question that’s going to demonstrate to the sponsor that you’re doing your homework and not wasting his/her time with trivial questions that you could answer yourself.
Lastly, these are insights that you wouldn’t glean from either a simple “Data Preview” nor even the statistical profiling queries generated above in this post. Hence, I would definitely recommend including frequency distribution queries in your data profiling toolbox.
Real quick — you’ll note that in this case, I didn’t group all of the queries together in one big massive query daisy-chained with UNION ALL, like I did previously. Why?
Well, for one, I really want to investigate one column at a time, so I’d have to copy the results of such a massive query into Excel and then just filter one column at a time (or write a similar SQL filter). This doesn’t save me anytime compared to just flipping through result tabs in any modern SQL UI, so there’s no real value in daisy-chaining the queries together.
Secondly, it’s possible that some of these queries can produce a pretty massive result set, i.e. for those columns with high cardinality (i.e. high number of distinct values). So, your analysis for other columns can just get buried when the number of results exceeds your UI’s result set limit (usually 1000 or so). So, again, no value in daisy-chaining the results.
Unfortunately, Snowflake’s new Snowsight UI doesn’t support displaying multiple result sets after running multiple queries (although it does let you run them, which makes no sense for read-only queries). Hence, I downloaded DBeaver to accomplish this. There are a million other UIs you could use, I just picked one randomly.
But as you can see below, after running my data profiling query, I was able to copy the results, i.e. all of the subsequent frequency distribution queries, into DBeaver and run all of them, with multiple result sets available in the bottom of the screen (in the case of DBeaver, you just have to run “Execute SQL Script” rather than “Execute SQL Query”).
Now, all the same caveats as before, including:
Time for a night cap…
Final Thoughts
Personas
This tutorial really is based on a real-world project as described. The project sponsor, i.e. the one funding the project, was also the one with the tribal knowledge of the SQL Server instance in scope that I went back to with questions.
However, that’s probably the only project I’ve been on where the sponsor and the technical champion are the same person. As such, don’t read much into my cavalier use of terms in this article. In real life consulting and sales engineering engagements, it’s obviously critical to understand who your economic buyer, sponsor, champion, stakeholders, end users, etc. are. Fleshing out all the different roles, responsibilities, org charts, RACI charts of data engineering project is well outside the scope of this article — so just take that aspect of this post with a grain of salt.
Data Technology
It should go without saying that the analysis described in this post is going to be much more efficient in Snowflake (or any column store database) than SQL Server (or any row store database). With ELT being the more common pattern of the day in most cloud data engineering projects, I’ll state the obvious that you should plan on conducting most of your data profiling in your target system, after extracting raw data from your source systems. This would also be the case with your data lake architecture for any inefficient storage formats, such as CSV or JSON files. However, if you’re querying columnar formats from your lake, such as Parquet or Iceberg, you’re obviously going to get a lot more mileage.
One more comment, to slightly contradict myself. If you have an on-premise source system and data volumes that aren’t absurdly huge, and especially if your source system is some kind of analytical database, i.e. something with column storage, decent hardware, etc. — then, you might be in a position where it makes sense to batch up a lot of expensive data profiling queries over night. You’ll obviously spend way less on the electricity running your box under someone’s desk — then burning through the same credits in your cloud system. This case is much more the exception than the rule, but still worth keeping in mind.
Code Generation
One of the most clever Data Architects I ever hired brought a fully-fledged ETL framework to the table when he came on board. He had a very similar approach: leveraging metadata to auto-generate a significant amount of his ETL code, along with a very robust error-handling and logging framework. Some day I might ask him if it’s something he’s willing to share, but what I’ve tried to emphasize, is that it’s often much more about teaching a man/woman to fish than giving a man/woman a fish. Thus, I would encourage you to take the skills learned in this post and think through how you can extend them for other data engineering needs such as data observability and ETL.
COTS Software
When I first introduced the term “data profiling” (or really any of the other technical terms in this post and any others I write), and if you hadn’t previously heard of it, did you look it up? Here’s the Wikipedia article. One of my pet peeves on LinkedIn is so-called “influencers” publishing well-understood patterns/solutions from data engineering as if they had invented it. So, I obviously did not invent data profiling, but it is true that it’s rare to find this level of analysis as a skill of most Data Engineers. So, I do truly hope you found it valuable.
I also want to call out that data profiling is typically in the purview of the discipline of data quality (these days called “data observability”). Legacy tools like Informatica and BusinessObjects Information Steward, as well as more modern data observability solutions like Monte Carlo and Anomalo, all come up with out-of-the-box data profiling capabilities, as it’s critical to profile source systems to understand what kind of data cleansing logic you need to apply before loading the consumption layer of your data warehouse.
So, if these capabilities are available “out of the box”, why bother with all this hand-coded SQL?
Well, the most important reason, I would say, is that there are many data analytics projects that simply don’t have the budget for a fully-fledged end-to-end enterprise data warehouse with all the bells and whistles, including data quality / data observability tools. So, it’s that much more important to share these kinds of skills with the market so that Data Engineers can still be highly effective under limited budgets.
Secondly, my experience in life really has been that you’re more effective if you can show all your work (as opposed to using calculators or software), not less. It’s just like in high school. The kid who could estimate sine and cosine values of any given angle (I can’t and I have a math degree) without a scientific calculator had a much better grasp of the concepts, learned new concepts much quicker, and made much fewer mistakes. So, too, with data engineering. The more you know how to do “the hard way”, the better you’ll be even when executing your work “the easy way”.
Lastly — personal preference. Many folks in IT do as much as they can via command line rather than GUI-driven workflows. I find myself constantly tweaking my queries, and I’m often much faster and more efficient doing so just in a SQL console rather than fiddling with the various dashboarding features in a data profiling tool.
As you can tell, this is my first blog post in a very long time detailing rather technical content. As such, please forgive any rough edges, points of confusion, etc. I really enjoyed writing this article, and I truly would value any/all feedback! Do you see a way to improve this post? Whether it’s removing confusion, adding clarity, adding additional concepts, or if you simply want to share whether you enjoyed (or hated it), let me know! A DM on LinkedIn or an email to jody@heschconsulting.com would be fantastic. Thanks so much!