Analyzing Databricks performance using Ganglia

To understand how the machines inside a Databricks cluster are working, we can look at the Ganglia dashboard. It happens to be a monitoring system of high-performance computing where we can check telemetry pertaining to CPU, memory usage, network, etc.

By analyzing the Ganglia, we can understand if our cluster configuration is under-provisioned, over-provisioned or provisioned appropriately. It’s largely complementary to Databricks UI, though it also works at the cluster and at the node level. We can see node level metrics such as CPU consumption, memory consumption, disk usage, network-level IO – all node level factors that can affect the stability and performance of our job. This can allow us to see if our nodes are configured appropriately, to institute manual scaling or auto-scaling, or to change instance types.

NOTE: Ganglia does not have job-specific insights, neither does it work with pipelines. There is a lack of good output options; we are not left with many options besides taking a screen snapshot to get a JPEG or PNG image of the current status.

We can find the Ganglia at Databricks Clusters Metrics, and it is shown below.

No alt text provided for this image

Figure 1: Ganglia metrics and their interpretation

The above diagram shows an example of a balanced server load distribution. The below shows an example of an unbalanced server load distribution. Look out for the red squares. Those indicate the hot spots where load is more.

No alt text provided for this image

Figure 2: Example of a disbalanced server load distribution

Another thing we need to be on the lookout for is memory swapping as this indicates pressure on RAM. The below diagram highlights an example.

No alt text provided for this image

Figure 3: Indicator of RAM pressure

Best practices for using Ganglia

A.   Consider the cluster Memory last hour and cluster CPU last hour dashboards. Look for usage and idle times.

B.   Consider the Server Load Distribution map to get an idea of how balanced the workload across the nodes is. Absence of red squares indicates a well-balanced load.

C.   Be on the lookout for Memory Swapping. It can be detected in cluster Memory last hour dashboard by seeing a small purple line over a red line as indicated in the above diagram. This indicates memory pressure.

Awesome article, pls accept my connection req Rahul Biswas

Like
Reply

Super helpful! On the server load distribution, is it fair to say orange squares on all nodes would indicate optimal utilization?

Like
Reply

To view or add a comment, sign in

More articles by Rahul Biswas

  • Getting familiar with Claude Code statusline

    Most people using Claude Code don't leverage its cool "statusline" capability to its fullest potential. I built a…

    1 Comment
  • Consolidated articles list

    Yesterday, my inbox overflowed with congratulatory messages as I shared the milestone of reaching 10K+ followers. The…

    10 Comments
  • Handling OOM Errors in Apache Spark – A Knowledge Deep Dive

    A data engineer (3+ YOE) was recently asked in an interview how she would handle Out-of-Memory (OOM) errors in Spark…

    11 Comments
  • A gentle intro to Structured Streaming - I

    Not so long ago, in a kingdom not that far away, there lived a tap and an artificial stream. The stream got its water…

    17 Comments
  • An interview with Apache Spark

    Interviewer: So, Mr. Spark, let's cut right to the chase.

    18 Comments
  • A date with Azure Synapse Analytics

    I've a problem. I've collected 100 billion records of data that I need to do some data crunching on in order to get…

    6 Comments
  • How Azure Data Factory pipelines can be analysed using Python recursion

    SPOILER ALERT! This article might technically seem to be more about recursion using Python, than Azure Data Factory per…

    2 Comments
  • Ctrl-C Ctrl-V Power BI models

    I think any developer shall agree that the two best things to have happened to anybody in software engineering are…

    6 Comments
  • Building Azure Data Factory pipelines using Python

    Right off the bat, I would like to lay out the motivations which led me to explore automated creation of Azure Data…

    27 Comments

Others also viewed

Explore content categories