Data Transformation Tools: Application or Code?

Nick White

Published Nov 13, 2020

There seem to be two camps in the data transformation world:

· Those who use applications (Informatica, Talend, Matillion, Alteryx, etc)

· Those who write code (Python, PL/SQL, T-SQL, etc)

I will admit my bias up-front and state that I sit in the “applications” camp. I can understand why someone might code an ad-hoc data transformation to produce a single dataset, if they happen to already know a suitable coding language - but I’ve always struggled to understand why anyone would choose to code an ETL/ELT solution (of any significant size and which other people will need to develop and support) rather than use a dedicated application.

ETL/ELT applications were first developed 10-20 years ago to move people away from having to write code – so why has there been a move back to writing code in recent years?

Can anyone persuade me that writing code is better solution than using a dedicated application – at least for a production solution of any significant size?

I have laid out a few of the pros and cons for applications as I see them. In order to keep these lists short I’ve limited them to just issues where I feel the reverse applies to a coding approach i.e. a “pro” in this list is a “con” for coding and a “con” in this list is a “pro” for coding.

Pros

Applications are quicker to learn

- Give a reasonably technical person a data transformation problem and ask them to solve it using both Matillion and Python (neither of which they already know) and I’d guarantee that they would take less time to complete the task in Matillion

Applications are easier to understand (and therefore support)

- Present a (non-trivial) application job/workflow to someone and then the same logic expressed in code and it will take them less time to understand the process by looking at the application than by looking at the code

Applications have at least the potential to be easily integrated into a wider data governance eco-system

- If you change the logic used to calculate a field value, you can tell, for example, which of your Tableau reports might be impacted by this change

- You can display the data quality score for a field as it traverses a multi-step pipeline

- You can see, immediately, where you are holding and processing PII data (that you have tagged as such in your Data Glossary)

Cons

Applications cost money (and often significant amounts of money); most coding languages are free

- Obviously, the up-front cost favours coding; I have no idea if this advantage continues when you consider the TCO for developing and supporting a solution – does anyone know if there has been any reliable (i.e. independent) research into this?

Applications probably execute more slowly than code

- I would expect well-written code to execute faster than a well-written application; even if the actual code in both solutions executes at the same speed, there is always an additional overhead when running an application

Applications do not integrate easily (or at all) into a CI/CD workflow

- With a pure code solution, using CI/CD is simple. However applications tend to have their own way of working and even when they can, technically, be integrated into a CI/CD workflow they tend to be “black boxes” rather than fully managed by your CI/CD components

Data Transformation Tools: Application or Code?

Nick White

Others also viewed

🚀 How to Handle Large CSV or Excel Files (2 Lakh+ Records) in Workato

On ETL: How I hacked Power Query's M?

ETL parallelism (SSIS)

On Snowflake partition pruning, SQL coding standards and regular expressions

Why is the Pipe Syntax in the upcoming Apache Spark 4.0 interesting to explore?

Use Case Discovery Workshop :: Performance Improvement Convert HiveQL To Spark SQL

SQL vs Pandas: The ALTER Operation🛠️

Robust LLM-Driven SQL Generation

ETL Skills for the Data Scientist

FastAPI Playbook: Best Practices Chapter 2

Explore content categories