Graph-Powered Observability Data Analysis in Databricks with Unity Catalog Credential Vending
Observability captures the complex interactions within modern distributed systems. Efficient analytics are made possible by PuppyGraph + Databricks

Graph-Powered Observability Data Analysis in Databricks with Unity Catalog Credential Vending

https://www.databricks.com/dataaisummit/session/graph-powered-observability-data-analysis-databricks-credential-vending is the recent lightning talk about the architecture which complements the existing Datadog/Splunk system by dual shipping the observability traces & metrics to the Open Data Lake such as Databricks Uniform as cost-efficient storage layer; and then leverage PuppyGraph (and other query engines) to analyze the massive observability data with Credential Vending and RBAC from Unity Catalog.

Storage

Volume Metrics (per day)

  • 18~26B events
  • 1600~2900 GiB
  • 1000~1700 Parquet files

Optimized Layout

  • Partitioned by Hour or Day
  • Sub-partitioned by the prefix of trace_id
  • Z-Ordering by (trace_id, span_id)

Partitions are determined at the write-time, so they do not have the overhead of the deferred Liquid clustering (or Snowflake auto-clustering). Because the data volume is massive, we have to make sure the partition pruning is happening as soon as events are written to Data Lake. The Z-ordering can be then precisely controlled by "OPTIMIZE datadog_traces WHERE _time_part>=? and _time_part<? ZORDER BY (trace_id, span_id)"

Service Map/Graph

Service Map visualizes data collected by Datadog APM and RUM
Service Map visualizes data collected by Datadog APM and RUM

Service Map is a powerful graph analysis provided by Datadog. With PuppyGraph, we can have deeper controls of augmenting metrics such as failure, throughput, number API calls within latency SLA, data lineage... to the service map topology.

Such a topology can be pre-computed on the daily basis or queried with 1~3 hours live data on the fly.

Incident Root Cause Analysis

Eric Weiss gave a very comprehensive talk (Lost in Tracing? How Coinbase Connects the Dots for Faster RCA) 5:37~6:15 - at DASH on 6/11/2025. His talks covers more details about the use case, data flow, and challenges amplified by massive telemetry data points.

This article is a great collaboration among Danfeng Xu (PuppyGraph) Xinyu Liu (who couldn't travel to the Databricks Summit this time, so I sat in for him) Yisheng Liang Eric Weiss Chris Mueller Clinton Begin

Thanks for all the support from Databricks - Michelle Leon Toby Messinger Janelle Davies


[PDF deck for this Databricks talk]

Great write-up and excellent job leading the session! Thanks for your guidance and support — it was a pleasure working on this session with you.

PuppyGraph + Databricks + Unity Catalog = chef’s kiss 🤌🤌🤌 Thank you for the great write up!

To view or add a comment, sign in

More articles by Eric Sun

Others also viewed

Explore content categories