LLM Observability Stack v2.0: Why Infrastructure Testing Changed Everything

Deepak Soni

Published Feb 13, 2026

+ Follow

From Reactive Firefighting to Proactive Validation

The Problem We All Face

You've deployed your LLM inference server. The YAML looks perfect. Kubernetes says the pods are running. But then...

3 AM alert: GPU out of memory
Customer complaint: Inference latency spiked to 30 seconds
Monday morning: "The model isn't responding."

Sound familiar?

After deploying dozens of LLM workloads on GPU clusters, I realized something critical: we were always reacting to failures instead of preventing them.

That's why I built LLM Observability Stack v2.0 - and why the two key enhancements changed everything.

What's New in Version 2.0?

Enhancement #1: Infrastructure Testing with Pytest

The game-changer: Validate your infrastructure BEFORE it breaks.

In v1.0, we had beautiful Grafana dashboards showing real-time GPU metrics. But dashboards only tell you something is wrong after it happens.

v2.0 introduces a complete pytest infrastructure testing framework that runs validation checks before deployment:

============================================================
  LLM OBSERVABILITY STACK v2.0 - INFRASTRUCTURE TESTS
============================================================

--- Cluster Health Tests ---
PASSED:  vLLM Health Endpoint          - HTTP 200 OK
PASSED:  vLLM Models Endpoint          - Models available
PASSED:  vLLM Inference Completion     - Inference working

--- GPU Tests ---
PASSED:  NVIDIA-SMI Available          - nvidia-smi accessible
PASSED:  CUDA Available                - torch.cuda.is_available()
PASSED:  GPU Memory > 20GB             - NVIDIA A10 24GB

--- Network Tests ---
PASSED:  Kubernetes DNS Resolution     - kubernetes.default.svc resolved
PASSED:  Elasticsearch Connectivity    - HTTP 200 on port 9200

============================================================
  RESULTS: 11 passed, 1 failed, 0 skipped
============================================================

Why this matters:

The test suite validates:

GPU Health: CUDA availability, memory capacity, temperature
Network: DNS resolution, service connectivity
Storage: Write permissions, model cache paths
Endpoints: Health checks, API responses, inference capability
Kubernetes: Pod scheduling, resource allocation

Enhancement #2: ELK Stack for Centralized Logging

Metrics tell you WHAT happened. Logs tell you WHY.

Grafana and Prometheus are excellent for metrics. But when you're debugging why inference failed for a specific request, you need logs.

v2.0 adds a complete ELK stack:

Filebeat (Collection) --> Logstash (Processing) --> Elasticsearch (Storage) --> Kibana (Visualization)

The results speak for themselves:

Metric Value

Log Documents Indexed 248,805 +

Unique Pods Monitored 50

Containers Tracked37

Namespaces Covered5

Query Response Time< 100ms

Real debugging scenarios solved:

"Why did inference fail?"
"What happened to the GPU?"
"Show me all errors in the last hour"

The Architecture

+-------------------------------------------------------------------+
|                     KUBERNETES CLUSTER                             |
|                                                                    |
|  +-------------------------------------------------------------+  |
|  |                  llm-observability namespace                 |  |
|  |                                                              |  |
|  |  [Elasticsearch] --> [Kibana]     (Logs & Dashboards)       |  |
|  |  [Logstash]      --> [Filebeat]   (Log Pipeline)            |  |
|  |  [Prometheus]    --> [Grafana]    (Metrics & Alerts)        |  |
|  +-------------------------------------------------------------+  |
|                                                                    |
|  +-------------------------------------------------------------+  |
|  |  [vLLM + Mistral-7B]    [DCGM Exporter]    [GPU Operator]   |  |
|  |        GPU Node 1              GPU Node 2                    |  |
|  |     [A10 24GB x 2]          [A10 24GB x 2]                  |  |
|  +-------------------------------------------------------------+  |
+-------------------------------------------------------------------+

Why Both? The Observability Triangle

                    PREVENTION
                        /\
                       /  \
                      /    \
                     / Pytest\
                    /  Tests  \
                   /          \
                  /____________\
                 /              \
                /                \
          METRICS              LOGS
         (Prometheus)        (ELK Stack)

         "What is           "Why did it
          happening?"        happen?"

Pytest = Prevent issues before deployment Prometheus/Grafana = Monitor what's happening now ELK Stack = Understand why things happened

Together, they create a complete observability story.

Recommended by LinkedIn

Get Your Popcorn Ready: eBPF!

Brian Clabby 9 months ago

Resource Exhausted error in Kubernetes

Avinash Tietler 1 year ago

🚀 Understanding Pod Resource Requests & Limits in…

Ashish Pradhan 5 months ago

Real-World Impact

Before v2.0:

Deployed vLLM on OKE
Looked good in kubectl
2 hours later: OOM errors
Spent 4 hours debugging

After v2.0:

Ran pytest infrastructure tests
Test failed: "GPU Memory check failed - only 16GB available"
Fixed node pool configuration
Deployed with confidence
Zero incidents

Time saved: ~4 hours per deployment

It's a Framework, Not Just a Stack

Component Customisation For

Pytest Tests Your specific validation needs

Kibana Dashboards Your metrics and KPIs

Logstash Pipelines Your log formats

Alert Rules Your SLAs

Add tests for your use case:

def test_my_model_loaded():
    """Verify specific model is loaded"""
    response = requests.get('http://vllm:8000/v1/models')
    models = response.json()['data']
    assert any(m['id'] == 'my-custom-model' for m in models)

Technical Stack

This isn't a one-size-fits-all solution. It's a template you can customise:

Key Takeaways

Metrics alone aren't enough - You need logs to debug issues
Dashboards are reactive - Tests are proactive
Infrastructure testing catches issues before production - Not after
Centralized logging is essential - When you're debugging across 50 pods
Build a framework, not a one-off solution - Make it extensible

Get Started

The complete stack is open source and available on GitHub:

GitHub Repository: https://github.com/deepaksatna/LLM-Observability-Stack-v2.0

What's Included:

16 Kubernetes deployment manifests
4 Kibana dashboards (with screenshots)
9 Grafana dashboards
5 pytest test modules
Complete documentation
Deployment scripts

Test Environment:

Oracle Kubernetes Engine (OKE)
4x NVIDIA A10 GPUs (96GB total VRAM)
Mistral-7B-Instruct-v0.3
248,805+ log documents indexed

What's Next?

In future versions, I'm exploring:

Automated remediation - Self-healing based on test failures
Cost tracking - GPU utilization to cost mapping
Multi-cluster support - Federated observability
AI-powered log analysis - Using LLMs to analyze their own logs

Let's Connect

If you're running LLM workloads on Kubernetes and struggling with observability, I'd love to hear about your challenges.

What's the hardest part of monitoring your LLM infrastructure?

Drop a comment below or reach out directly.

LLM Observability Stack v2.0: Why Infrastructure Testing Changed Everything

Deepak Soni

The Problem We All Face

What's New in Version 2.0?

Enhancement #1: Infrastructure Testing with Pytest

Enhancement #2: ELK Stack for Centralized Logging

The Architecture

Why Both? The Observability Triangle

Recommended by LinkedIn

Real-World Impact

Before v2.0:

After v2.0:

It's a Framework, Not Just a Stack

Technical Stack

Key Takeaways

Get Started

What's Next?

Let's Connect

Beyond the Model

489 followers

More articles by Deepak Soni

Others also viewed

Node.js Clustering for Efficient systems

🛑 Stop Treating Logs as Free Storage: How Verbose Logging Crashes Production

Deep dive into Raft

How Modern Kernels Handle Massive Traffic : the use of SO_REUSEPORT

The Numbing Truth About Your Infrastructure

Avoid Noisy Neighbors in Kubernetes: A Deep Dive into Resource Quotas ⚖️

Approaching the IT Transformation Barrier (4)

K8s Pods Piling Up: How Lenient Topology Spread Turns HA into Hope

Explore content categories

The Problem We All Face

What's New in Version 2.0?

Enhancement #1: Infrastructure Testing with Pytest

Enhancement #2: ELK Stack for Centralized Logging

The Architecture

Why Both? The Observability Triangle

Recommended by LinkedIn

Real-World Impact

Before v2.0:

After v2.0:

It's a Framework, Not Just a Stack

Technical Stack

Key Takeaways

Get Started

What's Next?

Let's Connect

Beyond the Model

489 followers

More articles by Deepak Soni

The KV-Cache Bake-Off - TRT-LLM vs vLLM on the Same Reasoning Model

Why Your AI Training is Bottlenecked by the Network (Not GPUs) - And How to Fix It

Why Your GPUs Are Waiting: The Hidden Network Bottleneck in AI Training

The Hidden Performance Killers in Your LLM Training Pipeline

The Hidden Cost of Choosing the Wrong Distributed Training Strategy

We Cut Our LLM Inference Costs by 71% Without Sacrificing Quality – Here's How

Beyond MMLU- Why Traditional AI Benchmarks Are Failing Us

Understanding Reasoning-First LLMs: A Comprehensive Benchmark Study of NVIDIA Nemotron-3-Nano

LLM Observability: Why Monitoring Your AI Infrastructure is No Longer Optional

RAG Quality vs Speed: A Framework for Measuring What Matters

Others also viewed

Node.js Clustering for Efficient systems

🛑 Stop Treating Logs as Free Storage: How Verbose Logging Crashes Production

Deep dive into Raft

How Modern Kernels Handle Massive Traffic : the use of SO_REUSEPORT

The Numbing Truth About Your Infrastructure

Avoid Noisy Neighbors in Kubernetes: A Deep Dive into Resource Quotas ⚖️

Approaching the IT Transformation Barrier (4)

K8s Pods Piling Up: How Lenient Topology Spread Turns HA into Hope

Similar topics

Causes of Pod Scheduling Delays in Kubernetes

Monitoring LLM Performance Across Memory and GPUs

Streamlining LLM Inference for Lightweight Deployments

Explore content categories