Spaces:

jkottu
/

llm-inference-dashboard

Sleeping

App Files Files Community

llm-inference-dashboard / README.md

jkottu

Initial commit: LLM Inference Dashboard

aefabf0 2 months ago

preview code

raw

history blame contribute delete

4.3 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: LLM Inference Dashboard
emoji: 📊
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: mit

LLM Inference Dashboard

A production-grade Gradio dashboard for monitoring vLLM inference on multi-GPU setups with alerting, request tracing, A/B comparison, load testing, and historical analysis.

Features

Feature	Description
Core Monitoring	GPU stats, inference metrics, quantization info
Alerting	Configurable thresholds, Slack/webhook notifications
Request Tracing	Per-request latency breakdown, slow request logging
A/B Comparison	Side-by-side deployment comparison
Load Testing	Built-in load generator with saturation detection
Historical Analysis	SQLite storage, trend queries

Tabs

GPU / Rank Status - Real-time GPU memory, utilization, temperature, and tensor parallel rank mapping
Inference - Tokens/sec, TTFT, batch size, KV cache utilization, latency metrics
Quantization - Detect and display GPTQ, AWQ, bitsandbytes quantization settings
Loading - Model loading progress with shard tracking
Alerts - Configure alert thresholds and webhook notifications
Tracing - Request-level latency breakdown and slow request analysis
A/B Compare - Compare metrics between two vLLM deployments
Load Test - Run load tests with configurable concurrency and RPS

Usage

Local Development

pip install -r requirements.txt
python app.py

With vLLM Server

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model <model_name> \
  --tensor-parallel-size <N> \
  --port 8000

# Set environment variables (optional)
export VLLM_HOST=localhost
export VLLM_PORT=8000

# Launch dashboard
python app.py

Environment Variables

Variable	Default	Description
`VLLM_HOST`	localhost	vLLM server hostname
`VLLM_PORT`	8000	vLLM server port
`MODEL_PATH`	None	Path to model for quantization detection
`DB_PATH`	data/metrics.db	SQLite database path
`SLACK_WEBHOOK`	None	Slack webhook URL for alerts
`PAGERDUTY_KEY`	None	PagerDuty routing key

Demo Mode

When no vLLM server is connected, the dashboard runs in demo mode with simulated GPU metrics.

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Gradio Frontend                       │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐│
│  │GPU Stats│ │Loading  │ │Quant    │ │Inference Metrics││
│  │   Tab   │ │Progress │ │Details  │ │      Tab        ││
│  └─────────┘ └─────────┘ └─────────┘ └─────────────────┘│
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                  Metrics Collector                       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│  │ pynvml   │ │Prometheus│ │ vLLM API │ │Model Config│ │
│  │ (GPUs)   │ │ (/metrics)│ │ (status) │ │  (quant)   │ │
│  └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────┘

License

MIT