(Azure) Databricks: accelerating big data analytics with the Spark connector for Azure SQL

(Azure) Databricks, with Spark behind the scenes, is perfect for dealing with massive amount of data that for example you may have on your Data Lake.

The use of Spark allows the great optimization of your processes thanks to the distribution of the whole computation on clusters of machines which can be large or powerful according to your needs.

Despite this, sometimes the waiting times for the execution of processes can be quite long, especially if you need to process and save large amount of data in a (Azure) SQL database.

In this tiny post I will try to show you how slow Databricks is when you want to insert a lot of data into an Azure SQL database and how you could speed up this operation.

Scenario

Let's imagine you retrieved a large amount of data from the web in a stage of your (Azure) Databricks pipeline that ultimately will be saved into an Azure Data Lake folder. After that, you will process the data by transforming it into a desirable shape and store it into an Azure SQL database in order to make it available for further analysis.

The following is the architecture of the solution that we will discuss in this post:

Non è stato fornito nessun testo alternativo per questa immagine

Of course the persistence step of the data to the SQL database can be easily achieved using the built-in JDBC Spark-SQL connector but, unfortunately, this is a bottleneck when you want to manage large amount of data (read and write).

In the specific example that we will discuss, the folder contains files that sums up about 139.120.300 rows (7 columns) while the cluster configuration is as follows: DRIVER (Standard_F4s) - 8.0 GB Memory, 4 Cores, 0.5 DBU; Worker (1 - 4, Standard_F4s): 8.0 GB Memory, 4 Cores, 0.5 DBU. The Azure SQL database is S0 tier.

Using JDBC Spark connector

The JDBC connector gives the ability to connect to several relational databases. However, it is not optimized for data loading, and this can substantially affect data load throughput. In fact, as shown in this article https://techcommunity.microsoft.com/t5/datacat/turbo-boost-data-loads-from-spark-using-sql-spark-connector/ba-p/305523, with the standard connector an insert statement is performed for each available row of data in the dataframe which you want to write.

Nevertheless, we conducted some tests loading only part of the source data (667.648 rows) and the averaged execution time was about 12 minutes which, projected towards the total amount of rows (139.120.300) to be loaded was equivalent to about 42 hours: a very long execution time!

Using SQL Spark connector

Looking for a solution to this problem, we came across this very interesting post:

It talks about a new open-source Spark connector for SQL server and Azure SQL which "is up to 15x faster than generic JDBC connector for writing data to SQL Server". Moreover, unlike the standard connector, the new connector offers the direct binding with Python.

Integrate the new connector into your application is quite simple since it uses a very similar interface as the built-in one. In fact, a couple of lines of code are enough to make it work (Python):

df.write.format("com.microsoft.sqlserver.jdbc.spark")
        .mode("append")
        .option("batchsize", "500000")
        .option("tablelock", "true")
        .option("url", url)
        .option("dbtable", srctable)
        .save()

Note the format that is specified in the above code.

The most "challenging" thing is to add the connector as a library. Indeed, since it is really very new (22 June 2020), it is not possible to use Maven or pip yet. To do this, either you download the pre-built jar or (only the brave) just clone the source code available on github and then build it in order to create your own jar. Once this is done, the jar can be loaded directly as a library into the databricks cluster:

Non è stato fornito nessun testo alternativo per questa immagine

The first test with the new connector was to load 25.000.000 rows. The batch size was the default (1.000 rows per batch). The execution time now was about 180 minutes which, projected towards 139.120.300 rows is equivalent to about 16,69 hours: a good gain compared with the previous connector.

But we still wanted to improve the performance. We tuned the batch size and done several tests:

  • Batch size of 100.000 and 1.000.000 of rows -> the execution time was 4,5 minutes (would be 10,43 hours for all the rows);
  • Batch size of 500.000 and whole dataset (139.120.300 rows) -> the execution time was 5,88 hours.

It seems to be a good result but you can achieve even better performance.

Scaling SQL - Achieve more

While the data loading was taking place, we ran the following query on the target Azure SQL database:

SELECT  t.text, r.total_elapsed_time, r.row_count, r.*

FROM    sys.dm_exec_requests r

        CROSS APPLY sys.dm_exec_sql_text(sql_handle) t

--where command  = 'BULK INSERT'

order by session_id

We got the following result:

Non è stato fornito nessun testo alternativo per questa immagine

As you can see, the report shows some LOG_RATE_GOVERNOR wait, a sign that you have to scale the database to have even better performance. Thus, scaling the SQL to S2 tier, the time of the whole execution using a batch size of 500.000 becomes even better: 3,48 hours, about 11x faster than the first version which uses standard JDBC connector.

Conclusions

Writing large amounts of data using (Azure) Databricks on an Azure SQL database can be very slow despite the use of Spark. This article suggested one solution to lower down drastically the execution time of loading data using the new Spark to SQL connector.

Other tips:

  • For tables that are heaps, use the TABLOCK hint to allow parallel streams. This is particularly relevant for staging tables, which tend to be heaps;
  • For bulk loads into columnstore tables, do not use the TABLOCK hint, as that would serialize parallel streams.





To view or add a comment, sign in

Others also viewed

Explore content categories