Accurate GPU Metrics with nv-monitor: NVIDIA NVML Spec Compliance

I needed to monitor GPU metrics across an AI cluster. Should be simple, right? Three tools, two plugins, a container runtime, and half a day later -- the numbers were still wrong. Unified memory misreported. HugePages ignored. ARM topology invisible. So I built nv-monitor. One C file. One compile. One binary under 80KB. I drop it on any node -- ARM or x86 -- and I immediately get accurate CPU, memory, and GPU metrics. No Python, no containers, no dependencies. I built it against NVIDIA's actual NVML spec and tested on real DGX Spark hardware. Most monitoring tools get these numbers wrong. Mine doesn't. I needed cluster-wide visibility too, so every binary includes a built-in Prometheus exporter. Point Grafana at your fleet and you're done. Minutes, not days. It even ships with a synthetic load generator -- I can fire up realistic CPU and GPU patterns across every node to prove the whole pipeline works end-to-end before real workloads hit. I've open-sourced it (MIT): https://lnkd.in/ebXJ9r4G #AI #NVIDIA #GPU #MachineLearning #DGXSpark #MLOps #OpenSource #DevTools #Monitoring #Prometheus #Grafana #DeepLearning #AIInfrastructure

  • No alternative text description for this image

ARM-based AI hardware like the Dell Pro Max GB10 / DGX Spark is basically a monitoring blind spot right now — there’s no solid off-the-shelf answer for this class of machine. Respect for building against real hardware. Two things I’d love to know: are you breaking out NVFP4 vs FP8 vs FP16 compute utilisation separately? Standard NVML nvmlDeviceGetUtilizationRates doesn’t do that and on Blackwell it’s the number that actually tells you if the hardware is being used right. Second — cross-node interconnect bandwidth over the QSFP fabric? Per-node GPU stats are useful but in a multi-node inference cluster the bottleneck is usually the interconnect, not the GPU itself. This is impressive.

I coded GPU/CPU monitor for my setup in 3 mins.

  • No alternative text description for this image

I was thinking of getting a DGX but the memory bandwidth seems a bit low. Are you happy with its performance?

Like
Reply

Wow awesome !! Does it have configurable monitor intervals? For ex.: I want it check every 2 secs.

any plans to add in RDMA metrics?

Like
Reply

Very impressive, however you use can also you grafana

Like
Reply

This tool looks really useful Paul Gresham would it work on Edge computers like Orin Nano?

A doubt. How is this different from application like btop(in Linux)?

This is extremely useful. I just recently configured a 2-node active-active Kubernetes cluster with vLLM+Ray over RoCE (using the DGX Sparks), and this utility couldn't have come at a better time. Thank you for this project!

Honestly, if AI helps bring us back to low-level systems programming languages that are more efficient with resources, I'm game. Cool looking project!

See more comments

To view or add a comment, sign in

Explore content categories