Optimize performance with Delta Lake

Optimize performance with Delta Lake


Delta Lake on Azure Databricks supports two layout algorithms: 

bin-packing and Z-Ordering.

bin-packing: coalescing small files into larger ones will help in read performance. Run optimize to trigger compaction

Specify partition if you want to optimize small partitions

Z-Ordering : co-locate related information in the same set of files.  Delta lake automatically uses data-skipping algorithms to reduce the amount of data that needs to be read. 

When to use Z order

If column is used in query predicates and has high cardinality 

Delta tables  readers use snapshot isolation, this means no interruption when optimize removes necessary files from transaction log. It returns the file satistics (min, max, total, ZOrder statistics(number of batches, partitions optimized)) for files removed 

Auto Optimize: Can also compact file during individual writes to delta table.

There are two features: Optimized writes and Auto compaction

Optimize writes: Dynamically optimize spark partition size based on actual data, write out 128 MB for each table

Auto compaction: Compaction of small files.

In the CREATE TABLE command

Set the table properties 

delta.autoOptimize.optimizeWrite = true 

delta.autoOptimize.autoCompact = true .

For tables with size greater than 10 TB, 

keep OPTIMIZE running on a schedule to consolidate files, and reduce the metadata of your Delta table. 

No support for Z-Ordering on Auto optimize, hence schedule OPTIMIZE ... ZORDER BY jobs to run periodically.

Auto optimize triggers compaction only if the count of files is more than 50 small files in directory

For custom behaviour use spark.databricks.delta.autoCompact.minNumFiles

delta.autoOptimize.autoCompact = true, false, legacy, auto( recommended)

Auto Compaction uses default setting 32 MB as target file size

Auto Compaction when set to legacy or true uses default setting 128 MB as target file size

Optimizing writes: maximize throughput

Reducing the number of files being written, without sacrificing too much parallelism.

Write Require the shuffling of data according to the partitioning structure of the target table.Optimized Writes is that it is an adaptive shuffle

Write to Delta Lake, and then immediately call OPTIMIZE OR Enable Auto Compaction, and then write to Delta Lake

Auto Compaction generates smaller files (128 MB) and OPTIMIZE generates  (1 GB).

Due to transaction failure Auto Optimize job are failed and are ignored ,however the read stream will continue to operate normally.

By default Delta Lake on Azure Databricks collects statistics on the first 32 columns defined in your table schema.


Nice summary. Correction / clarification : Optimize writes: Dynamically optimize spark partition size based on actual data, write out 128 MB for each table. Is it PARTITION or table?

Like
Reply

To view or add a comment, sign in

More articles by Shuvendu Kumar Das

  • Schema inference and evolution in Auto Loader

    --------------------------------------------- ---------------------------------------------- Schema inference and…

    1 Comment

Others also viewed

Explore content categories