Accurate GPU Metrics with nv-monitor: NVIDIA NVML Spec Compliance

1mo

I needed to monitor GPU metrics across an AI cluster. Should be simple, right? Three tools, two plugins, a container runtime, and half a day later -- the numbers were still wrong. Unified memory misreported. HugePages ignored. ARM topology invisible. So I built nv-monitor. One C file. One compile. One binary under 80KB. I drop it on any node -- ARM or x86 -- and I immediately get accurate CPU, memory, and GPU metrics. No Python, no containers, no dependencies. I built it against NVIDIA's actual NVML spec and tested on real DGX Spark hardware. Most monitoring tools get these numbers wrong. Mine doesn't. I needed cluster-wide visibility too, so every binary includes a built-in Prometheus exporter. Point Grafana at your fleet and you're done. Minutes, not days. It even ships with a synthetic load generator -- I can fire up realistic CPU and GPU patterns across every node to prove the whole pipeline works end-to-end before real workloads hit. I've open-sourced it (MIT): https://lnkd.in/ebXJ9r4G #AI #NVIDIA #GPU #MachineLearning #DGXSpark #MLOps #OpenSource #DevTools #Monitoring #Prometheus #Grafana #DeepLearning #AIInfrastructure

104 Comments

Solo Hussain 3w

ARM-based AI hardware like the Dell Pro Max GB10 / DGX Spark is basically a monitoring blind spot right now — there’s no solid off-the-shelf answer for this class of machine. Respect for building against real hardware. Two things I’d love to know: are you breaking out NVFP4 vs FP8 vs FP16 compute utilisation separately? Standard NVML nvmlDeviceGetUtilizationRates doesn’t do that and on Blackwell it’s the number that actually tells you if the hardware is being used right. Second — cross-node interconnect bandwidth over the QSFP fabric? Per-node GPU stats are useful but in a multi-node inference cluster the bottleneck is usually the interconnect, not the GPU itself. This is impressive.

1 Reaction

Lance Harvie 1mo

I coded GPU/CPU monitor for my setup in 3 mins.

6 Reactions

Fabien Husson 4w

I was thinking of getting a DGX but the memory bandwidth seems a bit low. Are you happy with its performance?

Tuhin Mitra 1mo

Wow awesome !! Does it have configurable monitor intervals? For ex.: I want it check every 2 secs.

1 Reaction

Mark 👨💻 Clayton 1mo

any plans to add in RDMA metrics?

Syed Salman Ali 1mo

Very impressive, however you use can also you grafana

Azmyin Md. Kamal 1mo

This tool looks really useful Paul Gresham would it work on Edge computers like Orin Nano?

1 Reaction

Shrisha Kumar 1mo

A doubt. How is this different from application like btop(in Linux)?

1 Reaction

Brandon Jozsa 1mo

This is extremely useful. I just recently configured a 2-node active-active Kubernetes cluster with vLLM+Ray over RoCE (using the DGX Sparks), and this utility couldn't have come at a better time. Thank you for this project!

3 Reactions

Brian Jones 1mo

Honestly, if AI helps bring us back to low-level systems programming languages that are more efficient with resources, I'm game. Cool looking project!

10 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Allison Ding
3w
Report this post
I’ve recently integrated nv-monitor into my DGX Spark workflow. For those familiar with nvtop, this offers a very similar experience but as a lightweight, C-based alternative. Its main advantage lies in its efficiency and the ease of exporting metrics via Prometheus and CSV. I want to thank the inventor for his work on this and for his ongoing contributions to the open-source community. It’s great to have focused, high-performance tools like this available to everyone. #OpenSource #Nvidia #GPU #MachineLearning #DevOps
Paul Gresham

Information Technology Executive
1mo

I needed to monitor GPU metrics across an AI cluster. Should be simple, right? Three tools, two plugins, a container runtime, and half a day later -- the numbers were still wrong. Unified memory misreported. HugePages ignored. ARM topology invisible. So I built nv-monitor. One C file. One compile. One binary under 80KB. I drop it on any node -- ARM or x86 -- and I immediately get accurate CPU, memory, and GPU metrics. No Python, no containers, no dependencies. I built it against NVIDIA's actual NVML spec and tested on real DGX Spark hardware. Most monitoring tools get these numbers wrong. Mine doesn't. I needed cluster-wide visibility too, so every binary includes a built-in Prometheus exporter. Point Grafana at your fleet and you're done. Minutes, not days. It even ships with a synthetic load generator -- I can fire up realistic CPU and GPU patterns across every node to prove the whole pipeline works end-to-end before real workloads hit. I've open-sourced it (MIT): https://lnkd.in/ebXJ9r4G #AI #NVIDIA #GPU #MachineLearning #DGXSpark #MLOps #OpenSource #DevTools #Monitoring #Prometheus #Grafana #DeepLearning #AIInfrastructure
2 Comments
Like Comment
To view or add a comment, sign in
George Nyarko
3w
Report this post
Shipped a major update to my CUDA-to-ROCm migration agent — the tool I built during the lablab.ai AMD Hackathon on an MI300X GPU. What's new in v0.2.0: 1/ Semantic Equivalence Testing The tool now runs both original and migrated code, seeds the RNG for deterministic comparison, and verifies tensor outputs match with torch.allclose. Not just "does it parse" — "does it produce the same results." 2/ Docker Support (CPU + GPU) Multi-stage Dockerfile: a lightweight CPU image for rule-based migration anywhere, and a ROCm GPU image for full on-hardware validation. Plus docker-compose with a one-command vLLM planner server. 3/ pip-installable Package Proper Python packaging with classifiers, project URLs, and optional dependency groups. `pip install rocm-migrate` and you're ready to go. 4/ CI/CD Pipeline GitHub Actions release workflow: tag a version, and it automatically publishes to PyPI, pushes Docker images to GHCR, and creates a GitHub Release with changelog. 5/ 10-Point Validation Suite Every migration now runs through: cuDNN references, import validation, device strings, env variables, mixed imports, incompatible libraries, deprecated APIs, orphaned env vars, sandbox execution, and semantic equivalence. 6/ Migration Caching & Audit Trail Results are cached for fast re-runs. Every migration is logged with timestamps, confidence scores, and applied changes for full traceability. The full pipeline — rule-based pre-pass, DeepSeek-R1 planner on MI300X, Codestral executor, reviewer, tester — hits 97% confidence on complex CUDA code with custom kernels, pycuda, NVTX profiling, and mixed-precision training. Built and validated on AMD MI300X (192GB HBM3) via DigitalOcean Developer Cloud. lablab.ai AMD DeepSeek AI Mistral AI #ROCm #AMDDeveloper #CUDA #PyTorch #AIAgents #OpenSource #DeepLearning #GPU #CodeMigration
Like Comment
To view or add a comment, sign in
Pratham Panchariya
3w
Report this post
Day 43 of 100 Days of AI — 🚀 From Theory to Actually Running It Today theory ended. The actual pipeline began. 43 days in and I finally ran a quantized model for the first time. Here's what happened: Tried to run QLoRA locally first. My laptop has 16GB RAM but no NVIDIA GPU. bitsandbytes — the library that does 4-bit quantization — needs CUDA to work properly. Without a GPU it's painfully slow and has Windows compatibility issues. So I moved to Google Colab. Free tier. No paid plan. Just a free T4 GPU from Google — 15GB VRAM, CUDA available, bitsandbytes running perfectly. One line that changed everything: python torch.cuda.is_available() # False on my laptop # True on Colab That single difference is why quantization works on Colab and struggles locally. What I also corrected today about LoRA: I had a small misunderstanding — I thought LoRA adds one final layer at the end of the model. It doesn't. Wrong: Input → Layer1 → Layer2 → Layer3 → LoRA → Output Right: Input → Layer1+adapter → Layer2+adapter → Layer3+adapter → Output Small adapters plugged in throughout every layer. Not one layer at the end. Sometimes the most valuable learning is correcting what you thought you already knew. #100DaysOfAI #QLoRA #FineTuning #LLM #AIEngineering #MachineLearning #BuildInPublic #HuggingFace #GoogleColab
Like Comment
To view or add a comment, sign in
Pratik Mohapatra
1w
Report this post
I ran a 3B parameter LLM locally on a laptop GPU with 6GB VRAM. Here's exactly how I did it 🧵 Most people think local LLMs need expensive hardware. Wrong. With #Unsloth + 4-bit quantization, I ran Llama 3.2 3B on an RTX 3050 laptop — and it worked beautifully. Here's the full setup, zero to inference: STEP-1: Create an isolated Python environment. Then activate it. STEP-2: Install PyTorch with CUDA support [RTX 30xx = CUDA 12.1.] STEP-3: Verify your GPU is visible. [torch.cuda.is_available() -> TRUE] STEP-4: Install Unsloth + bitsandbytes. Unsloth makes 4-bit inference possible. bitsandbytes powers the quantization. On Windows, install bitsandbytes separately from a pre-built wheel to avoid CUDA mismatches. STEP-5: Watch out for torchao version conflicts. Newer torchao uses [torch.int1] which only exists in PyTorch 2.6+. If you're on 2.5.x, force upgrade PyTorch first or pin torchao to a compatible version. STEP-6: Run inference Load with [load_in_4bit=True] and [max_seq_length=1024]. The 3B model uses ~2.5GB VRAM in 4-bit, leaving plenty of headroom for the KV cache. The result? A fully local 3B LLM running. No API costs. No data leaving your machine. No rate limits. Want to go deeper into on-device AI? Send me a DM. #LocalAI #LLM #Quantization #Finetuning #InferenceEngineering #Unsloth #MachineLearning #Python #GenerativeAI #AIEngineering #OpenSource #AI
10 Comments
Like Comment
To view or add a comment, sign in
Pawan Kumar Raj
1w
Report this post
A 48GB GPU crashed… from just 7 users. Yes, only 7. It didn't make sense at first. Until I dug into the memory culprit. Let’s break it down 👇 The Setup ~13B LLM → ~26GB VRAM GPU → 48GB 👉 ~22GB looks free… right? ⚠️ The Real Villain: KV Cache Every user request spawns a KV cache (key-value pairs for attention). Stored per token & layer, all on GPU ~3GB per request (at 4K context) With concurrent users: 1 user → 3GB 3 users → 9GB 7 users → 21GB 💥 Total: 26GB (model) + 21GB (cache) = 47GB → CRASH (Quick math: Cache size ≈ 2×layers×head_dim×seq_len×precision×batch_size 2×layers×head_dim×seq_len×precision×batch_size) 💡 Production Fixes That Actually Work: 1. Paged KV Cache (vLLM-style) - Reuses memory, kills fragmentation - 10x concurrency boost 2. Slash Context Length - Trim prompts + smart RAG - Cache scales linearly with tokens 3. Quantize Everything - Model + cache to 8/4-bit - Frees 30-50% VRAM instantly 4. Dynamic Batching - Pack requests into one forward pass - Maximizes GPU utilization 5. Rate Limit + Queue - Cap concurrency to avoid spikes - Simple, reliable guardrail 6. Pro Serving Engines - vLLM, TensorRT-LLM, TGI - Built for real traffic ⚠️ Reality Check LLM engineering ≠ training models. It's serving them at scale under load. This is Bottleneck #1 for every GenAI builder. #LLM #GenAI #AIEngineering #MLOps #SystemDesign #Scalability #GPU #KVCache #AIJobs #Python #AI #ML #DataScience #Jobs #Models #Memory #RAM #Devops #Design #Training #Quantization

2 Comments
Like Comment
To view or add a comment, sign in
Hannu Varjoranta
1mo Edited
Report this post
Earlier I compressed the KV cache 3.7x in vLLM with no observed quality degradation. Now I applied the same algorithm to model weights. Qwen3-30B: 60 GB to 17 GB of GPU memory. No calibration data, no pre-quantized checkpoint. Load any BF16 model from HuggingFace, compress in 4 seconds, run. Quality: +3.4% perplexity on 30B, though smaller models degrade more. Speed didn't (40x slower, no fused kernels yet). Honest writeup of what works, what doesn't, and how the two features connect. The real value today is still KV cache compression: same speed, same quality, 3.7x more concurrent users. Weight compression fills a different gap. Running models that don't fit on your GPU at all, or evaluating new models on day one before anyone publishes an AWQ quant. Full results, the MoE expert problem, and the debugging journey from garbage output to working compression: https://lnkd.in/dkDeCwAS Built on TurboQuant (Zandieh et al., 2025). Tested on Verda GPU cloud in Helsinki. Code: https://lnkd.in/djKWuJiG #llm #vllm #inference #cuda #opensource

GitHub - varjoranta/turboquant-vllm: TurboQuant+ KV cache compression for vLLM. 3.8x smaller KV cache, same conversation quality. Fused CUDA kernels with automatic PyTorch fallback. github.com

1 Comment
Like Comment
To view or add a comment, sign in
Taofeek Olalekan
3w
Report this post
Co-design (Extreme Co-design) is the next frontier of AI Infrastructure and engineering. I love how NVIDIA is pushing CuTeDSL for kernel autotuning. nvcc per-variant compilation overhead was the bottleneck — CuTeDSL's Python-to-MLIR compiler resolves this with Triton-parity compile times and in-memory kernel caching on modern Blackwell architectures. FlashAttention-4 is implemented entirely in CuTe-DSL for GEMM kernel generation and good to see Torch Inductor autotuning also supporting CuTeDSL backend for optimal GEMM kernel generation. GPU kernel optimisation is the most decisive advantage in the next revolution of AI Infrastructure. It unlocks lower cost per TPS/GPU and better performance efficiency of data centre infrastructure and power usage. #AIInfrastructure #GPUKernels #NVIDIA #PyTorch #Blackwell
Like Comment
To view or add a comment, sign in
Matteo Avagliano
2w
Report this post
The future of LLMs is not just about training better models, but about making the models we already have truly available everywhere. It was a great experience at the vLLM meetup discussing how inference is evolving and why deployment efficiency is becoming such a critical part of the AI stack. If you want to explore how to deploy open source LLMs on OCI using NVIDIA GPUs and vLLM, you can read more here: https://lnkd.in/ddPmv3by Thanks Bending Spoons NVIDIA Red Hat Mistral AI Python Milano for this evening!
Like Comment
To view or add a comment, sign in
TONY T.
2w
Report this post
**Stop scaling your vector databases. We are solving the wrong problem.** The industry is currently obsessed with brute-forcing infinite context windows and stacking massive GPU clusters to fix AI amnesia. I built Mnemosyne OS simply as a local survival mechanism. I needed my AI agents to remember the architecture of my complex monorepos without hallucinating structural dependencies. I abandoned stochastic vector drift and built 'Deterministic Spines'—strict, JSON-based topological memory routing. Then, I ran the LongMemEval (ICLR 2025) benchmark on my local machine. The result? **87.4% raw accuracy.** Running purely local. On consumer hardware (RTX 4050 laptop GPU). That amplitude changed the trajectory of the project. It proved that the semantic memory bottleneck is officially broken if you switch from proximity guessing to deterministic mapping. The architectural manifestos and the math behind this are now public. I am taking the OS completely off-grid for the next 3 days (just an EcoFlow battery and a Starlink) to stress-test the VRAM orchestration in a zero-infrastructure environment. If you are tired of rebuilding context plumbing every time you swap models, read the whitepapers. Let's debate the architecture. Links to the theoretical math and the Open-Core Beta waitlist are in the comments. 👇 #NeuralCoding #MnemosyneOS #LocalAI #EdgeCompute #SoftwareArchitecture

3 Comments
Like Comment
To view or add a comment, sign in

3,081 followers

47 Posts

View Profile Follow

Accurate GPU Metrics with nv-monitor: NVIDIA NVML Spec Compliance

More Relevant Posts

Explore related topics

Explore content categories