Thinking out of box - an example
About two years ago, I worked on a project where we had to migrate an enterprise application from an MPP to Spark with Delta.io on an on-prem Hadoop environment.
We had around 17000 SQL statements in 4000 files. These 4000 files were then executed as per a schedule. For execution, the SQLs were executed on the MPP using a custom database driver. Yup. This was because the application had developed over a period of 20 years.
When we started, we had all the SQL statements, which had to be converted to be compatible with Spark SQL syntax. The scheduler was to be replaced by Airflow. As always, the deadline was pretty aggressive. But that is not the story.
To execute the SQLs on the target platform, we had to build a Python framework. Using suitable metadata, we generated the PySpark applications. The generator referred the SQLs from an Excel and then stored them into the required structure. Each program was independent and had one or more SQL statements executed using SparkSQL. Before we executed the programs in a schedule, we had to ensure that each application worked as expected. For this, the team submitted each application and checked its results. The SQL queries and the Python framework were modified as needed in parallel. Each PySpark application was independent. In other words, all that it needed for execution was available in the same program. A down side of this was that each application ended up with the same methods being repeated. Which methods? We had one method to execute an SQL such that it took care of exceptions and only returned a True / False depending on execution. We also had a method to fetch parameter values and populate them in the SQL statement before executing it. So, these two methods were repeated in each program.
You may ask why not store the SQLs in a database and have the framework pull the queries from the database. Four reasons, with the first being time. Each connection to the database takes additional time. The second reason was to do with SQL modification. As the SQLs with in the Python file and each file was independent, each developer could edit her file, make changes and submit the application without affecting any other script. We had cases where people overwrote their own files. In some cases, we also lost a few files. The third reason was the ability of a database cell to be a proper editor. So the team would have to copy paste the code into an editor like Notepad++ (I know. People are not comfortable using vi), editing the program and then putting the query back into the proper cell. And here 'proper' is important. Given the typical database editors we use, it is quite easy to go and overwrite another cell. The fourth point being the number of people editing the database simultaneously. Imagine 40 people editing the same table. It would be a classic horror story.
When we started executing the PySpark applications through Airflow, we found that initialization of the Spark session takes significant time. If we could combine the execution of multiple PySpark applications, we could save some time. Does that not sound like a wonderful idea? 'Combine multiple PySpark applications into one application'. The question was, how were we to do it with minimal impact on the delivery timeline? Even if we took the time to train and guide each team member on the changes needed, we were looking at significant effort. Combining multiple PySpark applications was not trivial. We would have to edit each file, remove common methods and move the code present in the 'main' into a method with the required parameters. Then we would have to write another program that had to import the modified scripts and execute them in the proper order.
It was quite a monumental task, given the number of team members and number of files involved.
I did the next best thing - I automated the process.
Recommended by LinkedIn
I wrote a Python program to scan through each script, extract the SQL statements from each script and upload them in the proper order into an SQLite database. Why not Excel? Because some SQLs were so large that they exceeded the character limit of a cell in Excel. So I had to use a database. Running a remote server would have taken time. So a local database like SQLite was the best option. Because the PySpark applications were generated by a program, the structure and naming of methods and variables was well defined and there was no change. A remote database also added to the time taken for application generation. Using Excel or SQLite, the time taken for code generation was around 15 minutes. With a remote database, the application needed two hours for code generation.
After extracting all the SQLs into a database table, I had the liberty of redesigning the architecture, making use of common classes and common methods. The new version of the application generator created the script without a 'main' and also generated a program that executed the scripts without main, in the required order. All script names and their order was managed using suitable metadata.
I was able to successfully extract SQLs from 95% of the PySpark applications - and execution time was in minutes. The task took only around 10 to 15 minutes to process around 4000 scripts. I ended up saving the teams many days of mundane work. For all practical purposes, the team continued to use the helper shell scripts for application execution, though the helper scripts had also to be changed.
The only downside is that the transition was with so less hiccups that the effort was not appreciated.
How much time did we same by changing the architecture? We chopped around four hours from the schedule, which was earlier running for around 24 hours. Obviously much more optimization was too follow because the schedule was supposed to execute in eight hours and hence execute tree times a day.
What did can we learn from this exercise?
I liked the take aways, very true!!