Best Practices for Azure Sql data warehouse Data Load using polybase or single-client gated load methods

*Best practices and considerations when using PolyBase

Here are a few more things to consider when using PolyBase for SQL Data Warehouse loads:

1) A single PolyBase load operation provides best performance.

2) The load performance scales as you increase DWUs.

 3) PolyBase automatically parallelizes the data load process, so you don’t need to explicitly break the input data into multiple sources and issue concurrent loads, unlike some traditional loading practices.

4)Multiple readers will not work against compressed text files (e.g. gzip). Only a single reader is used per compressed file since uncompressing the file in the buffer is single threaded. Alternatively, generate multiple compressed files. The number of files should be greater than or equal to the total number of readers.

  5)Multiple readers will work against compressed columnar/block format files (e.g. ORC, RC) since individual blocks are compressed independently.


*Best practices and considerations for single-client gated load methods

Consider the following when using SSIS, BCP, or other Control-node and client-gated loading methods:

1) Issue multiple threads into different tables to improve throughput. SQL DW does not support loading multiple threads into the same table since it requires exclusive lock. (This only applies to non-PolyBase load method).

  2) Include retry logic—very important for slower methods such as BCP, SSIS, and SQLBulkCopy.

  3)For SSIS, consider increasing the client/connection timeout from the default 30 seconds to 300 seconds. For more information about moving data to Azure, see SSIS for Azure and Hybrid Data Movement.

  4)Don’t specify the batch size with Control-node gated methods. The goal is to load all or nothing so that the retry logic will restart the load. If you designate a batch size and the load encounters failure (for example, network or database not available), you may need to add more logic to restart from the last successful commit.


To view or add a comment, sign in

More articles by Rahul Kumar

  • Why Multidimensional Scaling Fails?

    Why MDS Fails in giving us Meaningful Embeddings: MDS basically arranges points in 2D or lower dimension based on high…

  • Ensemble Learning

    Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem. In…

  • Radial basis function network

    If the classes or pattern are linearly separable then single layer Perceptron is sufficient otherwise we need to…

    4 Comments
  • Linear Discriminant Analysis

    Lets start with Limitations of Logistic Regression Logistic regression is a simple and powerful linear classification…

  • Resolving MERGE Performance in Azure SQL Database

    When merging large data-sets in Azure SQL Database its imperative to optimize our queries. Failure to do so will most…

    1 Comment
  • ORC vs RC file format

    ORC offers a number of features not available in RC files: * Better encoding of data. Integer values are run length…

  • Partitioning clustered columnstore tables in Azure Sql Data-warehouse

    Partitioning can be used to improve performance some scenarios, creating a table with too many partitions can hurt…

  • NoSql Database Modelling Challenges

    *Modelling Challenge: Related data to embed or reference related data?? related data: embedding Considerations: 1)data…

Others also viewed

Explore content categories