Optimize performance with Delta Lake

Shuvendu Kumar Das

Published Mar 24, 2022

+ Follow

Delta Lake on Azure Databricks supports two layout algorithms:

bin-packing and Z-Ordering.

bin-packing: coalescing small files into larger ones will help in read performance. Run optimize to trigger compaction

Specify partition if you want to optimize small partitions

Z-Ordering : co-locate related information in the same set of files. Delta lake automatically uses data-skipping algorithms to reduce the amount of data that needs to be read.

When to use Z order

If column is used in query predicates and has high cardinality

Delta tables readers use snapshot isolation, this means no interruption when optimize removes necessary files from transaction log. It returns the file satistics (min, max, total, ZOrder statistics(number of batches, partitions optimized)) for files removed

Auto Optimize: Can also compact file during individual writes to delta table.

There are two features: Optimized writes and Auto compaction

Optimize writes: Dynamically optimize spark partition size based on actual data, write out 128 MB for each table

Auto compaction: Compaction of small files.

In the CREATE TABLE command

Set the table properties

delta.autoOptimize.optimizeWrite = true

delta.autoOptimize.autoCompact = true .

Recommended by LinkedIn

Spark Cache vs. Delta Cache in Databricks: Which one…

Ranjeev Kumar 3 days ago

🚀 Delta Live Tables in Databricks… meet Materialized…

Francois de Wet 8 months ago

Happy Paths: Why I am Looking Forward to Azure Synapse…

Neil Hepburn 3 years ago

For tables with size greater than 10 TB,

keep OPTIMIZE running on a schedule to consolidate files, and reduce the metadata of your Delta table.

No support for Z-Ordering on Auto optimize, hence schedule OPTIMIZE ... ZORDER BY jobs to run periodically.

Auto optimize triggers compaction only if the count of files is more than 50 small files in directory

For custom behaviour use spark.databricks.delta.autoCompact.minNumFiles

delta.autoOptimize.autoCompact = true, false, legacy, auto( recommended)

Auto Compaction uses default setting 32 MB as target file size

Auto Compaction when set to legacy or true uses default setting 128 MB as target file size

Optimizing writes: maximize throughput

Reducing the number of files being written, without sacrificing too much parallelism.

Write Require the shuffling of data according to the partitioning structure of the target table.Optimized Writes is that it is an adaptive shuffle

Write to Delta Lake, and then immediately call OPTIMIZE OR Enable Auto Compaction, and then write to Delta Lake

Auto Compaction generates smaller files (128 MB) and OPTIMIZE generates (1 GB).

Due to transaction failure Auto Optimize job are failed and are ignored ,however the read stream will continue to operate normally.

By default Delta Lake on Azure Databricks collects statistics on the first 32 columns defined in your table schema.

Bharanidharan muthukumar 2y

Nice summary. Correction / clarification : Optimize writes: Dynamically optimize spark partition size based on actual data, write out 128 MB for each table. Is it PARTITION or table?

To view or add a comment, sign in

Optimize performance with Delta Lake

Shuvendu Kumar Das

Recommended by LinkedIn

More articles by Shuvendu Kumar Das

Others also viewed

Build a real-time recommendation API on Azure

Experimenting with Databricks Volumes

Summary of Latest Updates from GCP summit, Databricks AI summit and Microsoft Build conference 2021

Microsoft Azure Databricks Performance Tips for analytics purposes.

Why Databricks Is Going All-In on Apps (And Why It Actually Makes Sense)

Insights from the Microsoft Netherlands Databricks Event

Why dbt Still Makes Sense in Databricks

Databricks in 2026 — It's Not Just A Data Platform Anymore. And That Changes Everything.

How We Uncovered a 3X Cost Spike in Databricks — in Under 15 Minutes - Databricks Genie Code

INCREMENTAL DATA LOAD DATABRICKS

Explore content categories

Recommended by LinkedIn

More articles by Shuvendu Kumar Das

Schema inference and evolution in Auto Loader

Others also viewed

Build a real-time recommendation API on Azure

Experimenting with Databricks Volumes

Summary of Latest Updates from GCP summit, Databricks AI summit and Microsoft Build conference 2021

Microsoft Azure Databricks Performance Tips for analytics purposes.

Why Databricks Is Going All-In on Apps (And Why It Actually Makes Sense)

Insights from the Microsoft Netherlands Databricks Event

Why dbt Still Makes Sense in Databricks

Databricks in 2026 — It's Not Just A Data Platform Anymore. And That Changes Everything.

How We Uncovered a 3X Cost Spike in Databricks — in Under 15 Minutes - Databricks Genie Code

INCREMENTAL DATA LOAD DATABRICKS

Similar topics

How to Optimize Pyspark Job Performance

How to Optimize Data Serialization

Tips for Optimizing Apache Spark Performance

Explore content categories