What is the future for Data Migration ?

What is the future for Data Migration ?

Over the past couple of months, I must admit I have been thinking increasingly that there must be better frameworks or toolsets than SSIS for performing Data Migration efficiently, robustly, and repeatedly. I do agree that SSIS is a powerful and versatile tool for performing Extract, Transform and Load workflows on disparate data sources, and there a plethora of libraries that can be added to SSIS to enable interaction with the wide range of formats that are out there.

Most recently, I have been spending a great deal of time developing and extending SSIS packages and can see that this is a powerful tool to have in my coding toolbelt. More and more, I have been reading about Python can be used with more freedom to perform most and if not all ETL needs. When you combine Python with two modules specifically made for manipulating and analyzing data at scale, these are Dask and Pandas.

Python has a great advantage in that is free and open-source, and already has an extensive number of packages that are free to use, and as a Data Engineer, you can easily add to the underlying source code if you come across a bug or see a feature. Python is able to talk to many data sources using other open-source packages; from CSV files to Kafka to scraping websites.

SSIS, conversely, requires you to license any machine running it just as you would license any other machine running a full instance of SQL Server. So when thinking about scalability and offloading your ETL processing to other nodes, you will incur more licensing costs. As you can imagine, the costs can scale up very quickly.

Through using tools like Dask, Python is horizontally scalable. Dask is specifically designed to handle the issues of datasets that are too large to fit in memory on a single node and is scalable across many nodes. Therefore, you can scale the size of your data processing environment as needed by your projects, making use of existing tools like Kubernetes, without the need to write complex code to distribute data across nodes.

What appeals to me the most from my learning is that Python is inherently testable. Python has several useful unit testing frameworks, such as unittest or Pytest. By my understanding, breaking up your ETL processes into consumable (White Box Components) units of code, greatly increases your ability to ensure expected behavior and make changes without fear of inadvertently breaking functionality.

In my opinion, Python is a better choice for ETL workflows, however, this does not take away from the fact that SSIS is extremely powerful, and with robust existing teams supporting it and investing heavily in SQL Server, then there is no need to change all of this. However, I know that I will be investing more time in Python to see how it can help and stretch me going forward.

Regarding the length of time needed to get fully skilled in Python, this is higher than in SSIS. However, SSIS has many drag and drop components that can be used straight away, to get the most functionality out of SSIS, the addition of scripting is necessary whether through VB.net or C#.net, or other languages. There are other technologies/frameworks for Data Migration such as Talend, which I am keen to learn about in equal measure. 

Cheers,

Kwasi Date-Bah

Updated Date: 11/06/2021

Been reading a very interesting article on medium.com concerning "Building Data Platforms - The ETL Bias" The article link is here: https://medium.com/codex/building-data-platforms-the-etl-bias-d589733ce4cc. This article is a very interesting read and clearly explains what needs to be done, to make ETL and data more centralised and core to a business.

Like
Reply

There are a lot of ETL solutions out there... personally, Talend gives me the ability to combine my Java coding with the build so is my preferred choice. However (like you) I've also been considering Python but this is to introduce AI into the solution.

Like
Reply

To view or add a comment, sign in

More articles by Kwasi Date-Bah

  • Smarter Approaches to ETL

    I am constantly learning and one website that I have recently come across is medium.com, which contains a lot of useful…

  • Planning a CRM migration

    I classify myself as a Data Migration Developer with extensive experience primarily within the Charity sector. I have…

    2 Comments

Others also viewed

Explore content categories