Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.12.0
metadata
title: LLM Inference Dashboard
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: mit
LLM Inference Dashboard
A production-grade Gradio dashboard for monitoring vLLM inference on multi-GPU setups with alerting, request tracing, A/B comparison, load testing, and historical analysis.
Features
| Feature | Description |
|---|---|
| Core Monitoring | GPU stats, inference metrics, quantization info |
| Alerting | Configurable thresholds, Slack/webhook notifications |
| Request Tracing | Per-request latency breakdown, slow request logging |
| A/B Comparison | Side-by-side deployment comparison |
| Load Testing | Built-in load generator with saturation detection |
| Historical Analysis | SQLite storage, trend queries |
Tabs
- GPU / Rank Status - Real-time GPU memory, utilization, temperature, and tensor parallel rank mapping
- Inference - Tokens/sec, TTFT, batch size, KV cache utilization, latency metrics
- Quantization - Detect and display GPTQ, AWQ, bitsandbytes quantization settings
- Loading - Model loading progress with shard tracking
- Alerts - Configure alert thresholds and webhook notifications
- Tracing - Request-level latency breakdown and slow request analysis
- A/B Compare - Compare metrics between two vLLM deployments
- Load Test - Run load tests with configurable concurrency and RPS
Usage
Local Development
pip install -r requirements.txt
python app.py
With vLLM Server
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model <model_name> \
--tensor-parallel-size <N> \
--port 8000
# Set environment variables (optional)
export VLLM_HOST=localhost
export VLLM_PORT=8000
# Launch dashboard
python app.py
Environment Variables
| Variable | Default | Description |
|---|---|---|
VLLM_HOST |
localhost | vLLM server hostname |
VLLM_PORT |
8000 | vLLM server port |
MODEL_PATH |
None | Path to model for quantization detection |
DB_PATH |
data/metrics.db | SQLite database path |
SLACK_WEBHOOK |
None | Slack webhook URL for alerts |
PAGERDUTY_KEY |
None | PagerDuty routing key |
Demo Mode
When no vLLM server is connected, the dashboard runs in demo mode with simulated GPU metrics.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gradio Frontend β
β βββββββββββ βββββββββββ βββββββββββ ββββββββββββββββββββ
β βGPU Statsβ βLoading β βQuant β βInference Metricsββ
β β Tab β βProgress β βDetails β β Tab ββ
β βββββββββββ βββββββββββ βββββββββββ ββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Metrics Collector β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββ β
β β pynvml β βPrometheusβ β vLLM API β βModel Configβ β
β β (GPUs) β β (/metrics)β β (status) β β (quant) β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
License
MIT