LLM Observability Stack v2.0: Why Infrastructure Testing Changed Everything

LLM Observability Stack v2.0: Why Infrastructure Testing Changed Everything

From Reactive Firefighting to Proactive Validation

The Problem We All Face

You've deployed your LLM inference server. The YAML looks perfect. Kubernetes says the pods are running. But then...

  • 3 AM alert: GPU out of memory
  • Customer complaint: Inference latency spiked to 30 seconds
  • Monday morning: "The model isn't responding."

Sound familiar?

After deploying dozens of LLM workloads on GPU clusters, I realized something critical: we were always reacting to failures instead of preventing them.

That's why I built LLM Observability Stack v2.0 - and why the two key enhancements changed everything.


What's New in Version 2.0?

Enhancement #1: Infrastructure Testing with Pytest

The game-changer: Validate your infrastructure BEFORE it breaks.

In v1.0, we had beautiful Grafana dashboards showing real-time GPU metrics. But dashboards only tell you something is wrong after it happens.

v2.0 introduces a complete pytest infrastructure testing framework that runs validation checks before deployment:

============================================================
  LLM OBSERVABILITY STACK v2.0 - INFRASTRUCTURE TESTS
============================================================

--- Cluster Health Tests ---
PASSED:  vLLM Health Endpoint          - HTTP 200 OK
PASSED:  vLLM Models Endpoint          - Models available
PASSED:  vLLM Inference Completion     - Inference working

--- GPU Tests ---
PASSED:  NVIDIA-SMI Available          - nvidia-smi accessible
PASSED:  CUDA Available                - torch.cuda.is_available()
PASSED:  GPU Memory > 20GB             - NVIDIA A10 24GB

--- Network Tests ---
PASSED:  Kubernetes DNS Resolution     - kubernetes.default.svc resolved
PASSED:  Elasticsearch Connectivity    - HTTP 200 on port 9200

============================================================
  RESULTS: 11 passed, 1 failed, 0 skipped
============================================================
        

Why this matters:

Article content

The test suite validates:

  • GPU Health: CUDA availability, memory capacity, temperature
  • Network: DNS resolution, service connectivity
  • Storage: Write permissions, model cache paths
  • Endpoints: Health checks, API responses, inference capability
  • Kubernetes: Pod scheduling, resource allocation


Enhancement #2: ELK Stack for Centralized Logging

Metrics tell you WHAT happened. Logs tell you WHY.

Grafana and Prometheus are excellent for metrics. But when you're debugging why inference failed for a specific request, you need logs.

v2.0 adds a complete ELK stack:

Filebeat (Collection) --> Logstash (Processing) --> Elasticsearch (Storage) --> Kibana (Visualization)
        

The results speak for themselves:

Metric Value

Log Documents Indexed 248,805 +

Unique Pods Monitored 50

Containers Tracked37

Namespaces Covered5

Query Response Time< 100ms

Real debugging scenarios solved:

  1. "Why did inference fail?"
  2. "What happened to the GPU?"
  3. "Show me all errors in the last hour"


The Architecture

+-------------------------------------------------------------------+
|                     KUBERNETES CLUSTER                             |
|                                                                    |
|  +-------------------------------------------------------------+  |
|  |                  llm-observability namespace                 |  |
|  |                                                              |  |
|  |  [Elasticsearch] --> [Kibana]     (Logs & Dashboards)       |  |
|  |  [Logstash]      --> [Filebeat]   (Log Pipeline)            |  |
|  |  [Prometheus]    --> [Grafana]    (Metrics & Alerts)        |  |
|  +-------------------------------------------------------------+  |
|                                                                    |
|  +-------------------------------------------------------------+  |
|  |  [vLLM + Mistral-7B]    [DCGM Exporter]    [GPU Operator]   |  |
|  |        GPU Node 1              GPU Node 2                    |  |
|  |     [A10 24GB x 2]          [A10 24GB x 2]                  |  |
|  +-------------------------------------------------------------+  |
+-------------------------------------------------------------------+
        

Why Both? The Observability Triangle

                    PREVENTION
                        /\
                       /  \
                      /    \
                     / Pytest\
                    /  Tests  \
                   /          \
                  /____________\
                 /              \
                /                \
          METRICS              LOGS
         (Prometheus)        (ELK Stack)

         "What is           "Why did it
          happening?"        happen?"
        

Pytest = Prevent issues before deployment Prometheus/Grafana = Monitor what's happening now ELK Stack = Understand why things happened

Together, they create a complete observability story.


Real-World Impact

Before v2.0:

  • Deployed vLLM on OKE
  • Looked good in kubectl
  • 2 hours later: OOM errors
  • Spent 4 hours debugging

After v2.0:

  • Ran pytest infrastructure tests
  • Test failed: "GPU Memory check failed - only 16GB available"
  • Fixed node pool configuration
  • Deployed with confidence
  • Zero incidents

Time saved: ~4 hours per deployment


It's a Framework, Not Just a Stack

Component Customisation For

Pytest Tests Your specific validation needs

Kibana Dashboards Your metrics and KPIs

Logstash Pipelines Your log formats

Alert Rules Your SLAs

Add tests for your use case:

def test_my_model_loaded():
    """Verify specific model is loaded"""
    response = requests.get('http://vllm:8000/v1/models')
    models = response.json()['data']
    assert any(m['id'] == 'my-custom-model' for m in models)
        

Technical Stack

This isn't a one-size-fits-all solution. It's a template you can customise:

Article content

Key Takeaways

  1. Metrics alone aren't enough - You need logs to debug issues
  2. Dashboards are reactive - Tests are proactive
  3. Infrastructure testing catches issues before production - Not after
  4. Centralized logging is essential - When you're debugging across 50 pods
  5. Build a framework, not a one-off solution - Make it extensible

Article content

Get Started

The complete stack is open source and available on GitHub:

GitHub Repository: https://github.com/deepaksatna/LLM-Observability-Stack-v2.0

What's Included:

  • 16 Kubernetes deployment manifests
  • 4 Kibana dashboards (with screenshots)
  • 9 Grafana dashboards
  • 5 pytest test modules
  • Complete documentation
  • Deployment scripts

Test Environment:

  • Oracle Kubernetes Engine (OKE)
  • 4x NVIDIA A10 GPUs (96GB total VRAM)
  • Mistral-7B-Instruct-v0.3
  • 248,805+ log documents indexed


What's Next?

In future versions, I'm exploring:

  • Automated remediation - Self-healing based on test failures
  • Cost tracking - GPU utilization to cost mapping
  • Multi-cluster support - Federated observability
  • AI-powered log analysis - Using LLMs to analyze their own logs


Let's Connect

If you're running LLM workloads on Kubernetes and struggling with observability, I'd love to hear about your challenges.

What's the hardest part of monitoring your LLM infrastructure?

Drop a comment below or reach out directly.



To view or add a comment, sign in

More articles by Deepak Soni

Others also viewed

Explore content categories