Analyzing Databricks performance using Ganglia

Rahul Biswas

Published May 5, 2022

To understand how the machines inside a Databricks cluster are working, we can look at the Ganglia dashboard. It happens to be a monitoring system of high-performance computing where we can check telemetry pertaining to CPU, memory usage, network, etc.

By analyzing the Ganglia, we can understand if our cluster configuration is under-provisioned, over-provisioned or provisioned appropriately. It’s largely complementary to Databricks UI, though it also works at the cluster and at the node level. We can see node level metrics such as CPU consumption, memory consumption, disk usage, network-level IO – all node level factors that can affect the stability and performance of our job. This can allow us to see if our nodes are configured appropriately, to institute manual scaling or auto-scaling, or to change instance types.

NOTE: Ganglia does not have job-specific insights, neither does it work with pipelines. There is a lack of good output options; we are not left with many options besides taking a screen snapshot to get a JPEG or PNG image of the current status.

We can find the Ganglia at Databricks Clusters > Metrics, and it is shown below.

Figure 1: Ganglia metrics and their interpretation

The above diagram shows an example of a balanced server load distribution. The below shows an example of an unbalanced server load distribution. Look out for the red squares. Those indicate the hot spots where load is more.

Recommended by LinkedIn

ceph or lustre

Yashar Esmaildokht 2 years ago

Fancy a byte? Bluffer's Guide to data storage.

Marcus Gilbert 9 years ago

Today in Data Storage History

Carl Watts 1 month ago

Figure 2: Example of a disbalanced server load distribution

Another thing we need to be on the lookout for is memory swapping as this indicates pressure on RAM. The below diagram highlights an example.

Figure 3: Indicator of RAM pressure

Best practices for using Ganglia

A. Consider the cluster Memory last hour and cluster CPU last hour dashboards. Look for usage and idle times.

B. Consider the Server Load Distribution map to get an idea of how balanced the workload across the nodes is. Absence of red squares indicates a well-balanced load.

C. Be on the lookout for Memory Swapping. It can be detected in cluster Memory last hour dashboard by seeing a small purple line over a red line as indicated in the above diagram. This indicates memory pressure.

Tanmay Singhal 3y

Awesome article, pls accept my connection req Rahul Biswas

Suraj P. 3y

Super helpful! On the server load distribution, is it fair to say orange squares on all nodes would indicate optimal utilization?

See more comments

To view or add a comment, sign in

Analyzing Databricks performance using Ganglia

Rahul Biswas

Recommended by LinkedIn

Best practices for using Ganglia

More articles by Rahul Biswas

Others also viewed

#38 Mastering the Cluster: The Role of the Head Node in Distributed Computing

Optimizing Spark with High Concurrency Mode

Serverless

How to Get Free Storage from IBM

LLM Observability Stack v2.0: Why Infrastructure Testing Changed Everything

Resource Exhausted error in Kubernetes

Computer Memory Part 3

HPDC-2017 Call for Papers

IBM Spectrum Virtualize V7.6 introducing advanced RAID technology: IBM distributed RAID

Explore content categories

Recommended by LinkedIn

Best practices for using Ganglia

More articles by Rahul Biswas

Getting familiar with Claude Code statusline

Consolidated articles list

Handling OOM Errors in Apache Spark – A Knowledge Deep Dive

A gentle intro to Structured Streaming - I

An interview with Apache Spark

A date with Azure Synapse Analytics

How Azure Data Factory pipelines can be analysed using Python recursion

Ctrl-C Ctrl-V Power BI models

Building Azure Data Factory pipelines using Python

Others also viewed

#38 Mastering the Cluster: The Role of the Head Node in Distributed Computing

Optimizing Spark with High Concurrency Mode

Serverless

How to Get Free Storage from IBM

LLM Observability Stack v2.0: Why Infrastructure Testing Changed Everything

Resource Exhausted error in Kubernetes

Computer Memory Part 3

HPDC-2017 Call for Papers

IBM Spectrum Virtualize V7.6 introducing advanced RAID technology: IBM distributed RAID

Explore content categories