jkottu's picture
Initial commit: LLM Inference Dashboard
aefabf0

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: LLM Inference Dashboard
emoji: πŸ“Š
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: mit

LLM Inference Dashboard

A production-grade Gradio dashboard for monitoring vLLM inference on multi-GPU setups with alerting, request tracing, A/B comparison, load testing, and historical analysis.

Features

Feature Description
Core Monitoring GPU stats, inference metrics, quantization info
Alerting Configurable thresholds, Slack/webhook notifications
Request Tracing Per-request latency breakdown, slow request logging
A/B Comparison Side-by-side deployment comparison
Load Testing Built-in load generator with saturation detection
Historical Analysis SQLite storage, trend queries

Tabs

  1. GPU / Rank Status - Real-time GPU memory, utilization, temperature, and tensor parallel rank mapping
  2. Inference - Tokens/sec, TTFT, batch size, KV cache utilization, latency metrics
  3. Quantization - Detect and display GPTQ, AWQ, bitsandbytes quantization settings
  4. Loading - Model loading progress with shard tracking
  5. Alerts - Configure alert thresholds and webhook notifications
  6. Tracing - Request-level latency breakdown and slow request analysis
  7. A/B Compare - Compare metrics between two vLLM deployments
  8. Load Test - Run load tests with configurable concurrency and RPS

Usage

Local Development

pip install -r requirements.txt
python app.py

With vLLM Server

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model <model_name> \
  --tensor-parallel-size <N> \
  --port 8000

# Set environment variables (optional)
export VLLM_HOST=localhost
export VLLM_PORT=8000

# Launch dashboard
python app.py

Environment Variables

Variable Default Description
VLLM_HOST localhost vLLM server hostname
VLLM_PORT 8000 vLLM server port
MODEL_PATH None Path to model for quantization detection
DB_PATH data/metrics.db SQLite database path
SLACK_WEBHOOK None Slack webhook URL for alerts
PAGERDUTY_KEY None PagerDuty routing key

Demo Mode

When no vLLM server is connected, the dashboard runs in demo mode with simulated GPU metrics.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Gradio Frontend                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚GPU Statsβ”‚ β”‚Loading  β”‚ β”‚Quant    β”‚ β”‚Inference Metricsβ”‚β”‚
β”‚  β”‚   Tab   β”‚ β”‚Progress β”‚ β”‚Details  β”‚ β”‚      Tab        β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Metrics Collector                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ pynvml   β”‚ β”‚Prometheusβ”‚ β”‚ vLLM API β”‚ β”‚Model Configβ”‚ β”‚
β”‚  β”‚ (GPUs)   β”‚ β”‚ (/metrics)β”‚ β”‚ (status) β”‚ β”‚  (quant)   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

License

MIT