Spaces:

jkottu
/

llm-inference-dashboard

Sleeping

jkottu Claude Opus 4.5 commited on Feb 17

Commit

aefabf0

0 Parent(s):

Initial commit: LLM Inference Dashboard

A production-grade Gradio dashboard for monitoring vLLM inference
on multi-GPU setups with:
- GPU/Rank monitoring (memory, utilization, temperature)
- Inference metrics (tokens/sec, TTFT, KV cache)
- Quantization detection (GPTQ, AWQ, bitsandbytes)
- Model loading progress tracking
- Alerting with Slack/webhook integration
- Request tracing with latency breakdown
- A/B deployment comparison
- Built-in load testing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (30) hide show

.gitignore +53 -0
README.md +103 -0
app.py +313 -0
collectors/__init__.py +13 -0
collectors/gpu_collector.py +174 -0
collectors/loading_tracker.py +224 -0
collectors/quant_collector.py +259 -0
collectors/vllm_collector.py +226 -0
components/__init__.py +27 -0
components/alerts_panel.py +253 -0
components/comparison_panel.py +207 -0
components/gpu_panel.py +191 -0
components/inference_panel.py +209 -0
components/loading_panel.py +151 -0
components/loadtest_panel.py +220 -0
components/quant_panel.py +118 -0
components/tracing_panel.py +186 -0
config.py +67 -0
requirements.txt +16 -0
services/__init__.py +18 -0
services/alerting.py +421 -0
services/comparator.py +366 -0
services/load_tester.py +359 -0
services/request_tracer.py +272 -0
storage/__init__.py +11 -0
storage/database.py +448 -0
storage/models.py +165 -0
utils/__init__.py +6 -0
utils/history.py +163 -0
utils/prometheus_parser.py +195 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,53 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+venv/
+ENV/
+env/
+.venv/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+# Data files
+data/*.db
+*.sqlite
+*.sqlite3
+# Logs
+*.log
+# OS
+.DS_Store
+Thumbs.db
+# Environment
+.env
+.env.local
+# Claude
+.claude/

README.md ADDED Viewed

	@@ -0,0 +1,103 @@

+---
+title: LLM Inference Dashboard
+emoji: 📊
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 5.9.1
+app_file: app.py
+pinned: false
+license: mit
+---
+# LLM Inference Dashboard
+A production-grade Gradio dashboard for monitoring vLLM inference on multi-GPU setups with alerting, request tracing, A/B comparison, load testing, and historical analysis.
+## Features
+| Feature | Description |
+|---------|-------------|
+| Core Monitoring | GPU stats, inference metrics, quantization info |
+| Alerting | Configurable thresholds, Slack/webhook notifications |
+| Request Tracing | Per-request latency breakdown, slow request logging |
+| A/B Comparison | Side-by-side deployment comparison |
+| Load Testing | Built-in load generator with saturation detection |
+| Historical Analysis | SQLite storage, trend queries |
+## Tabs
+1. **GPU / Rank Status** - Real-time GPU memory, utilization, temperature, and tensor parallel rank mapping
+2. **Inference** - Tokens/sec, TTFT, batch size, KV cache utilization, latency metrics
+3. **Quantization** - Detect and display GPTQ, AWQ, bitsandbytes quantization settings
+4. **Loading** - Model loading progress with shard tracking
+5. **Alerts** - Configure alert thresholds and webhook notifications
+6. **Tracing** - Request-level latency breakdown and slow request analysis
+7. **A/B Compare** - Compare metrics between two vLLM deployments
+8. **Load Test** - Run load tests with configurable concurrency and RPS
+## Usage
+### Local Development
+```bash
+pip install -r requirements.txt
+python app.py
+```
+### With vLLM Server
+```bash
+# Start vLLM server
+python -m vllm.entrypoints.openai.api_server \
+  --model <model_name> \
+  --tensor-parallel-size <N> \
+  --port 8000
+# Set environment variables (optional)
+export VLLM_HOST=localhost
+export VLLM_PORT=8000
+# Launch dashboard
+python app.py
+```
+## Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `VLLM_HOST` | localhost | vLLM server hostname |
+| `VLLM_PORT` | 8000 | vLLM server port |
+| `MODEL_PATH` | None | Path to model for quantization detection |
+| `DB_PATH` | data/metrics.db | SQLite database path |
+| `SLACK_WEBHOOK` | None | Slack webhook URL for alerts |
+| `PAGERDUTY_KEY` | None | PagerDuty routing key |
+## Demo Mode
+When no vLLM server is connected, the dashboard runs in demo mode with simulated GPU metrics.
+## Architecture
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Gradio Frontend                       │
+│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐│
+│  │GPU Stats│ │Loading  │ │Quant    │ │Inference Metrics││
+│  │   Tab   │ │Progress │ │Details  │ │      Tab        ││
+│  └─────────┘ └─────────┘ └─────────┘ └─────────────────┘│
+└─────────────────────────────────────────────────────────┘
+                          │
+                          ▼
+┌─────────────────────────────────────────────────────────┐
+│                  Metrics Collector                       │
+│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
+│  │ pynvml   │ │Prometheus│ │ vLLM API │ │Model Config│ │
+│  │ (GPUs)   │ │ (/metrics)│ │ (status) │ │  (quant)   │ │
+│  └──────────┘ └──────────┘ └──────────┘ └────────────┘ │
+└─────────────────────────────────────────────────────────┘
+```
+## License
+MIT

app.py ADDED Viewed

	@@ -0,0 +1,313 @@

+"""
+LLM Inference Dashboard - Main Application
+A production-grade Gradio dashboard for monitoring vLLM inference
+on multi-GPU setups with alerting, request tracing, A/B comparison,
+load testing, and historical analysis.
+"""
+import asyncio
+import logging
+import os
+from datetime import datetime
+import gradio as gr
+from config import config
+from collectors import GPUCollector, VLLMCollector, QuantizationCollector, LoadingTracker
+from components import (
+    create_gpu_panel,
+    update_gpu_panel,
+    create_inference_panel,
+    update_inference_panel,
+    create_quant_panel,
+    update_quant_panel,
+    create_loading_panel,
+    update_loading_panel,
+    create_alerts_panel,
+    update_alerts_panel,
+    create_tracing_panel,
+    update_tracing_panel,
+    create_comparison_panel,
+    create_loadtest_panel,
+)
+from components.alerts_panel import get_alert_badge_html
+from components.inference_panel import get_metrics_dict
+from services import AlertEngine, AlertDispatcher, RequestTracer
+from storage import MetricsDB
+from utils import MetricHistory
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+)
+logger = logging.getLogger(__name__)
+# Initialize global instances
+db = MetricsDB(config.db_path)
+history = MetricHistory(max_length=config.history_length)
+# Collectors
+gpu_collector = GPUCollector()
+vllm_collector = VLLMCollector(config.metrics_endpoint)
+quant_collector = QuantizationCollector(config.model_path)
+loading_tracker = LoadingTracker(config.model_path)
+# Services
+alert_engine = AlertEngine(db)
+alert_dispatcher = AlertDispatcher(
+    slack_webhook=config.slack_webhook,
+    pagerduty_key=config.pagerduty_routing_key,
+    generic_webhooks=config.generic_webhooks,
+)
+request_tracer = RequestTracer(db)
+def check_connection():
+    """Check connection to vLLM server."""
+    connected = vllm_collector.check_connection()
+    if connected:
+        return (
+            '<div style="display: flex; align-items: center;">'
+            '<span style="width: 12px; height: 12px; background: #4caf50; '
+            'border-radius: 50%; display: inline-block; margin-right: 8px;"></span>'
+            '<span style="color: #2e7d32;">Connected</span></div>'
+        )
+    return (
+        '<div style="display: flex; align-items: center;">'
+        '<span style="width: 12px; height: 12px; background: #f44336; '
+        'border-radius: 50%; display: inline-block; margin-right: 8px;"></span>'
+        '<span style="color: #c62828;">Disconnected</span></div>'
+    )
+def get_model_name():
+    """Get current model name."""
+    metrics = vllm_collector.collect()
+    return metrics.model_name or "Demo Mode"
+def update_all_metrics():
+    """Update all metrics from collectors."""
+    # GPU metrics
+    gpu_table, gpu_memory_plot, gpu_util_plot, nccl_status = update_gpu_panel(
+        gpu_collector, history
+    )
+    # Inference metrics
+    (
+        throughput, ttft, batch_size, kv_cache, throughput_plot,
+        prefill_pct, decode_pct, queue_depth, e2e_latency, latency_plot
+    ) = update_inference_panel(vllm_collector, history)
+    # Check alerts
+    metrics = vllm_collector.collect()
+    metrics_dict = get_metrics_dict(metrics)
+    # Add GPU metrics for alerting
+    gpu_stats = gpu_collector.collect()
+    if gpu_stats:
+        max_gpu_memory = max(s.memory_percent for s in gpu_stats)
+        metrics_dict["gpu_memory_percent"] = max_gpu_memory
+    new_alerts = alert_engine.evaluate(metrics_dict)
+    # Dispatch new alerts (handle async properly)
+    for alert in new_alerts:
+        try:
+            loop = asyncio.get_event_loop()
+            if loop.is_running():
+                asyncio.create_task(alert_dispatcher.dispatch(alert))
+            else:
+                loop.run_until_complete(alert_dispatcher.dispatch(alert))
+        except RuntimeError:
+            pass  # No event loop available
+    # Get alert badge
+    active_alerts = alert_engine.get_active_alerts()
+    alert_badge = get_alert_badge_html(active_alerts)
+    # Connection status
+    connection_status = check_connection()
+    model_name = get_model_name()
+    return (
+        # Header
+        connection_status,
+        model_name,
+        alert_badge,
+        # GPU tab
+        gpu_table,
+        gpu_memory_plot,
+        gpu_util_plot,
+        nccl_status,
+        # Inference tab
+        throughput,
+        ttft,
+        batch_size,
+        kv_cache,
+        throughput_plot,
+        prefill_pct,
+        decode_pct,
+        queue_depth,
+        e2e_latency,
+        latency_plot,
+    )
+def create_dashboard():
+    """Create the main dashboard application."""
+    custom_css = """
+        .gradio-container { max-width: 1400px !important; }
+        .panel-header { font-size: 1.2em; font-weight: bold; margin-bottom: 10px; }
+    """
+    with gr.Blocks(title="LLM Inference Dashboard") as app:
+        gr.Markdown("# LLM Inference Dashboard")
+        gr.Markdown("*Real-time monitoring for vLLM inference servers*")
+        # Header row: connection status, model info, active alerts
+        with gr.Row():
+            status_indicator = gr.HTML(
+                value=check_connection(),
+                label="Connection",
+            )
+            model_name_display = gr.Textbox(
+                label="Model",
+                value=get_model_name(),
+                interactive=False,
+                scale=2,
+            )
+            active_alerts_display = gr.HTML(
+                value=get_alert_badge_html([]),
+                label="Alerts",
+            )
+        # Main tabs
+        with gr.Tabs():
+            # Tab 1: GPU Status
+            with gr.Tab("GPU / Rank Status"):
+                gpu_components = create_gpu_panel(history)
+            # Tab 2: Inference Metrics
+            with gr.Tab("Inference"):
+                inference_components = create_inference_panel(history)
+            # Tab 3: Quantization
+            with gr.Tab("Quantization"):
+                quant_components = create_quant_panel()
+                # Initial update
+                (
+                    quant_type, bits, group_size, quant_details, layer_table
+                ) = update_quant_panel(quant_collector)
+                quant_components["quant_type"].value = quant_type
+                quant_components["bits"].value = bits
+                quant_components["group_size"].value = group_size
+            # Tab 4: Loading Progress
+            with gr.Tab("Loading"):
+                loading_components = create_loading_panel()
+            # Tab 5: Alerts
+            with gr.Tab("Alerts"):
+                alerts_components = create_alerts_panel(alert_engine, alert_dispatcher)
+            # Tab 6: Request Tracing
+            with gr.Tab("Tracing"):
+                tracing_components = create_tracing_panel(request_tracer)
+            # Tab 7: A/B Comparison
+            with gr.Tab("A/B Compare"):
+                comparison_components = create_comparison_panel()
+            # Tab 8: Load Testing
+            with gr.Tab("Load Test"):
+                loadtest_components = create_loadtest_panel()
+        # Auto-refresh timer
+        timer = gr.Timer(config.refresh_interval)
+        # Collect all outputs for timer update
+        timer_outputs = [
+            # Header
+            status_indicator,
+            model_name_display,
+            active_alerts_display,
+            # GPU tab
+            gpu_components["gpu_table"],
+            gpu_components["gpu_memory_plot"],
+            gpu_components["gpu_util_plot"],
+            gpu_components["nccl_status"],
+            # Inference tab
+            inference_components["throughput"],
+            inference_components["ttft"],
+            inference_components["batch_size"],
+            inference_components["kv_cache"],
+            inference_components["throughput_plot"],
+            inference_components["prefill_pct"],
+            inference_components["decode_pct"],
+            inference_components["queue_depth"],
+            inference_components["e2e_latency"],
+            inference_components["latency_plot"],
+        ]
+        timer.tick(fn=update_all_metrics, outputs=timer_outputs)
+        # Manual refresh for tabs that don't auto-update
+        def refresh_quant():
+            return update_quant_panel(quant_collector)
+        def refresh_loading():
+            return update_loading_panel(loading_tracker)
+        def refresh_alerts():
+            return update_alerts_panel(alert_engine, db)
+    return app
+def main():
+    """Main entry point."""
+    logger.info("Starting LLM Inference Dashboard")
+    logger.info(f"vLLM endpoint: {config.metrics_endpoint}")
+    logger.info(f"Database: {config.db_path}")
+    # Check initial connection
+    if vllm_collector.check_connection():
+        logger.info("Successfully connected to vLLM server")
+        # Set model ready if connected
+        loading_tracker.set_ready()
+        # Try to detect quantization
+        metrics = vllm_collector.collect()
+        if metrics.model_name:
+            quant_collector.set_model_path(metrics.model_name)
+    else:
+        logger.warning("Could not connect to vLLM server - dashboard will show mock data")
+    # Create and launch the dashboard
+    app = create_dashboard()
+    # Check if running on HuggingFace Spaces
+    if os.getenv("SPACE_ID"):
+        app.launch()
+    else:
+        app.launch(
+            server_name="0.0.0.0",
+            server_port=7860,
+            share=False,
+            show_error=True,
+        )
+# For HuggingFace Spaces - create demo instance
+demo = create_dashboard()
+if __name__ == "__main__":
+    main()

collectors/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""Data collectors for monitoring vLLM inference."""
+from .gpu_collector import GPUCollector
+from .vllm_collector import VLLMCollector
+from .quant_collector import QuantizationCollector
+from .loading_tracker import LoadingTracker
+__all__ = [
+    "GPUCollector",
+    "VLLMCollector",
+    "QuantizationCollector",
+    "LoadingTracker",
+]

collectors/gpu_collector.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""GPU statistics collector using pynvml."""
+from dataclasses import dataclass
+from typing import List, Optional
+import logging
+logger = logging.getLogger(__name__)
+# Try to import pynvml, provide mock if unavailable
+try:
+    import pynvml
+    PYNVML_AVAILABLE = True
+except ImportError:
+    PYNVML_AVAILABLE = False
+    logger.warning("pynvml not available - GPU stats will be simulated")
+@dataclass
+class GPUStats:
+    """Statistics for a single GPU."""
+    gpu_id: int
+    name: str
+    memory_used_gb: float
+    memory_total_gb: float
+    memory_percent: float
+    gpu_util_percent: float
+    temperature_c: int
+    power_watts: float
+    power_limit_watts: float
+    tp_rank: Optional[int] = None
+class GPUCollector:
+    """Collects GPU statistics via NVIDIA Management Library."""
+    def __init__(self):
+        """Initialize the GPU collector."""
+        self._initialized = False
+        self._gpu_count = 0
+        self._rank_mapping: dict = {}
+        if PYNVML_AVAILABLE:
+            try:
+                pynvml.nvmlInit()
+                self._initialized = True
+                self._gpu_count = pynvml.nvmlDeviceGetCount()
+                logger.info(f"Initialized pynvml with {self._gpu_count} GPUs")
+            except Exception as e:
+                logger.error(f"Failed to initialize pynvml: {e}")
+    def set_rank_mapping(self, mapping: dict) -> None:
+        """
+        Set tensor parallel rank to GPU ID mapping.
+        Args:
+            mapping: Dictionary mapping TP rank to GPU ID
+        """
+        self._rank_mapping = mapping
+    def get_gpu_count(self) -> int:
+        """Get the number of available GPUs."""
+        return self._gpu_count
+    def collect(self) -> List[GPUStats]:
+        """
+        Collect stats for all GPUs.
+        Returns:
+            List of GPUStats for each GPU
+        """
+        if not self._initialized:
+            return self._get_mock_stats()
+        stats = []
+        for i in range(self._gpu_count):
+            try:
+                stat = self._collect_single_gpu(i)
+                stats.append(stat)
+            except Exception as e:
+                logger.error(f"Error collecting stats for GPU {i}: {e}")
+        return stats
+    def _collect_single_gpu(self, gpu_id: int) -> GPUStats:
+        """Collect stats for a single GPU."""
+        handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id)
+        # Get device name
+        name = pynvml.nvmlDeviceGetName(handle)
+        if isinstance(name, bytes):
+            name = name.decode("utf-8")
+        # Memory info
+        mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
+        memory_used_gb = mem_info.used / 1e9
+        memory_total_gb = mem_info.total / 1e9
+        memory_percent = (mem_info.used / mem_info.total) * 100
+        # Utilization
+        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
+        gpu_util_percent = util.gpu
+        # Temperature
+        temperature_c = pynvml.nvmlDeviceGetTemperature(
+            handle, pynvml.NVML_TEMPERATURE_GPU
+        )
+        # Power
+        try:
+            power_watts = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0
+            power_limit_watts = pynvml.nvmlDeviceGetEnforcedPowerLimit(handle) / 1000.0
+        except pynvml.NVMLError:
+            power_watts = 0
+            power_limit_watts = 0
+        # Find TP rank for this GPU
+        tp_rank = None
+        for rank, gid in self._rank_mapping.items():
+            if gid == gpu_id:
+                tp_rank = rank
+                break
+        return GPUStats(
+            gpu_id=gpu_id,
+            name=name,
+            memory_used_gb=memory_used_gb,
+            memory_total_gb=memory_total_gb,
+            memory_percent=memory_percent,
+            gpu_util_percent=gpu_util_percent,
+            temperature_c=temperature_c,
+            power_watts=power_watts,
+            power_limit_watts=power_limit_watts,
+            tp_rank=tp_rank,
+        )
+    def _get_mock_stats(self) -> List[GPUStats]:
+        """Return mock stats when pynvml is not available."""
+        import random
+        mock_gpus = [
+            GPUStats(
+                gpu_id=0,
+                name="Mock GPU 0",
+                memory_used_gb=random.uniform(10, 20),
+                memory_total_gb=24.0,
+                memory_percent=random.uniform(40, 80),
+                gpu_util_percent=random.uniform(20, 90),
+                temperature_c=random.randint(40, 70),
+                power_watts=random.uniform(100, 300),
+                power_limit_watts=350,
+                tp_rank=0,
+            ),
+            GPUStats(
+                gpu_id=1,
+                name="Mock GPU 1",
+                memory_used_gb=random.uniform(10, 20),
+                memory_total_gb=24.0,
+                memory_percent=random.uniform(40, 80),
+                gpu_util_percent=random.uniform(20, 90),
+                temperature_c=random.randint(40, 70),
+                power_watts=random.uniform(100, 300),
+                power_limit_watts=350,
+                tp_rank=1,
+            ),
+        ]
+        return mock_gpus
+    def shutdown(self) -> None:
+        """Clean up NVML resources."""
+        if self._initialized and PYNVML_AVAILABLE:
+            try:
+                pynvml.nvmlShutdown()
+            except Exception:
+                pass

collectors/loading_tracker.py ADDED Viewed

	@@ -0,0 +1,224 @@

+"""Model loading progress tracker."""
+import json
+import re
+import logging
+from dataclasses import dataclass
+from typing import Optional, List, Dict, Any
+from pathlib import Path
+from enum import Enum
+logger = logging.getLogger(__name__)
+class LoadingStatus(Enum):
+    """Status of model loading."""
+    NOT_STARTED = "not_started"
+    DOWNLOADING = "downloading"
+    LOADING = "loading"
+    READY = "ready"
+    ERROR = "error"
+@dataclass
+class ShardInfo:
+    """Information about a model shard file."""
+    filename: str
+    size_mb: float
+    status: str  # pending, loading, loaded
+    layers: List[str]
+@dataclass
+class LoadingProgress:
+    """Overall loading progress."""
+    status: LoadingStatus
+    total_shards: int
+    loaded_shards: int
+    current_shard: Optional[str]
+    progress_percent: float
+    layers_loaded: int
+    total_layers: int
+    estimated_remaining_seconds: Optional[float]
+    error_message: Optional[str] = None
+class LoadingTracker:
+    """Tracks model loading progress."""
+    def __init__(self, model_path: Optional[str] = None):
+        """
+        Initialize loading tracker.
+        Args:
+            model_path: Path to model directory
+        """
+        self.model_path = model_path
+        self._shards: List[ShardInfo] = []
+        self._status = LoadingStatus.NOT_STARTED
+        self._progress = 0.0
+        self._current_shard: Optional[str] = None
+        self._layers_loaded = 0
+        self._total_layers = 0
+        self._start_time: Optional[float] = None
+    def set_model_path(self, model_path: str) -> None:
+        """Set or update the model path."""
+        self.model_path = model_path
+        self._load_shard_info()
+    def _load_shard_info(self) -> None:
+        """Load shard information from safetensors index."""
+        if not self.model_path:
+            return
+        index_path = self._resolve_path("model.safetensors.index.json")
+        if not index_path:
+            return
+        try:
+            with open(index_path) as f:
+                index = json.load(f)
+            weight_map = index.get("weight_map", {})
+            # Group weights by shard file
+            shard_weights: Dict[str, List[str]] = {}
+            for weight_name, shard_file in weight_map.items():
+                if shard_file not in shard_weights:
+                    shard_weights[shard_file] = []
+                shard_weights[shard_file].append(weight_name)
+            # Create shard info
+            self._shards = []
+            for shard_file, weights in sorted(shard_weights.items()):
+                shard_path = self._resolve_path(shard_file)
+                size_mb = 0
+                if shard_path and shard_path.exists():
+                    size_mb = shard_path.stat().st_size / (1024 * 1024)
+                # Extract layer names
+                layers = list(set(
+                    ".".join(w.split(".")[:3])
+                    for w in weights
+                    if len(w.split(".")) >= 3
+                ))
+                self._shards.append(ShardInfo(
+                    filename=shard_file,
+                    size_mb=size_mb,
+                    status="pending",
+                    layers=layers,
+                ))
+            # Count total layers
+            all_layers = set()
+            for shard in self._shards:
+                all_layers.update(shard.layers)
+            self._total_layers = len(all_layers)
+        except Exception as e:
+            logger.error(f"Error loading shard info: {e}")
+    def _resolve_path(self, filename: str) -> Optional[Path]:
+        """Resolve path to a file in the model directory."""
+        if not self.model_path:
+            return None
+        local_path = Path(self.model_path) / filename
+        if local_path.exists():
+            return local_path
+        return None
+    def update_from_log(self, log_line: str) -> None:
+        """
+        Update progress from a vLLM log line.
+        Args:
+            log_line: Log line from vLLM server
+        """
+        # Detect loading start
+        if "Loading model" in log_line:
+            self._status = LoadingStatus.LOADING
+            import time
+            self._start_time = time.time()
+        # Detect shard loading
+        match = re.search(r"Loading safetensors: (\d+)/(\d+)", log_line)
+        if match:
+            loaded = int(match.group(1))
+            total = int(match.group(2))
+            self._progress = loaded / total * 100
+            for i, shard in enumerate(self._shards):
+                if i < loaded:
+                    shard.status = "loaded"
+                elif i == loaded:
+                    shard.status = "loading"
+                    self._current_shard = shard.filename
+        # Detect completion
+        if "Model loaded" in log_line or "Running with" in log_line:
+            self._status = LoadingStatus.READY
+            self._progress = 100.0
+            for shard in self._shards:
+                shard.status = "loaded"
+        # Detect errors
+        if "Error" in log_line or "Exception" in log_line:
+            self._status = LoadingStatus.ERROR
+    def get_progress(self) -> LoadingProgress:
+        """
+        Get current loading progress.
+        Returns:
+            LoadingProgress with current state
+        """
+        loaded_shards = sum(1 for s in self._shards if s.status == "loaded")
+        total_shards = len(self._shards) if self._shards else 1
+        # Estimate remaining time
+        remaining = None
+        if self._start_time and self._progress > 0:
+            import time
+            elapsed = time.time() - self._start_time
+            remaining = (elapsed / self._progress) * (100 - self._progress)
+        # Count loaded layers
+        loaded_layers = set()
+        for shard in self._shards:
+            if shard.status == "loaded":
+                loaded_layers.update(shard.layers)
+        return LoadingProgress(
+            status=self._status,
+            total_shards=total_shards,
+            loaded_shards=loaded_shards,
+            current_shard=self._current_shard,
+            progress_percent=self._progress,
+            layers_loaded=len(loaded_layers),
+            total_layers=self._total_layers,
+            estimated_remaining_seconds=remaining,
+        )
+    def get_shards(self) -> List[ShardInfo]:
+        """Get list of all shards with their status."""
+        return self._shards
+    def set_ready(self) -> None:
+        """Mark the model as fully loaded."""
+        self._status = LoadingStatus.READY
+        self._progress = 100.0
+        for shard in self._shards:
+            shard.status = "loaded"
+    def reset(self) -> None:
+        """Reset progress tracker."""
+        self._status = LoadingStatus.NOT_STARTED
+        self._progress = 0.0
+        self._current_shard = None
+        self._layers_loaded = 0
+        self._start_time = None
+        for shard in self._shards:
+            shard.status = "pending"

collectors/quant_collector.py ADDED Viewed

	@@ -0,0 +1,259 @@

+"""Quantization information collector."""
+import json
+import os
+import logging
+from dataclasses import dataclass
+from typing import Optional, Dict, Any, List
+from pathlib import Path
+logger = logging.getLogger(__name__)
+@dataclass
+class QuantizationInfo:
+    """Quantization details for a model."""
+    method: str  # GPTQ, AWQ, bitsandbytes, None
+    bits: int
+    group_size: Optional[int] = None
+    desc_act: Optional[bool] = None
+    sym: Optional[bool] = None
+    compute_dtype: Optional[str] = None
+    quant_type: Optional[str] = None  # For bitsandbytes: nf4, fp4
+    double_quant: Optional[bool] = None
+    raw_config: Dict[str, Any] = None
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert to dictionary for JSON display."""
+        result = {
+            "method": self.method,
+            "bits": self.bits,
+        }
+        if self.group_size is not None:
+            result["group_size"] = self.group_size
+        if self.desc_act is not None:
+            result["desc_act"] = self.desc_act
+        if self.sym is not None:
+            result["symmetric"] = self.sym
+        if self.compute_dtype:
+            result["compute_dtype"] = self.compute_dtype
+        if self.quant_type:
+            result["quant_type"] = self.quant_type
+        if self.double_quant is not None:
+            result["double_quant"] = self.double_quant
+        return result
+@dataclass
+class LayerPrecision:
+    """Precision information for a model layer."""
+    layer_name: str
+    bits: int
+    group_size: Optional[int]
+    dtype: str
+class QuantizationCollector:
+    """Detects and collects quantization information from model configs."""
+    def __init__(self, model_path: Optional[str] = None):
+        """
+        Initialize quantization collector.
+        Args:
+            model_path: Path to model directory (local or HF model ID)
+        """
+        self.model_path = model_path
+        self._cached_info: Optional[QuantizationInfo] = None
+    def set_model_path(self, model_path: str) -> None:
+        """Set or update the model path."""
+        self.model_path = model_path
+        self._cached_info = None
+    def detect(self) -> QuantizationInfo:
+        """
+        Detect quantization method and settings.
+        Returns:
+            QuantizationInfo with detected settings
+        """
+        if self._cached_info is not None:
+            return self._cached_info
+        if not self.model_path:
+            return QuantizationInfo(method="Unknown", bits=16)
+        # Try to load config files
+        config = self._load_config()
+        quant_config = self._load_quant_config()
+        info = self._detect_quantization(config, quant_config)
+        self._cached_info = info
+        return info
+    def _load_config(self) -> Optional[Dict[str, Any]]:
+        """Load config.json from model path."""
+        config_path = self._resolve_path("config.json")
+        if config_path and config_path.exists():
+            try:
+                with open(config_path) as f:
+                    return json.load(f)
+            except Exception as e:
+                logger.error(f"Error loading config.json: {e}")
+        return None
+    def _load_quant_config(self) -> Optional[Dict[str, Any]]:
+        """Load quantize_config.json (GPTQ/AWQ) from model path."""
+        config_path = self._resolve_path("quantize_config.json")
+        if config_path and config_path.exists():
+            try:
+                with open(config_path) as f:
+                    return json.load(f)
+            except Exception as e:
+                logger.error(f"Error loading quantize_config.json: {e}")
+        return None
+    def _resolve_path(self, filename: str) -> Optional[Path]:
+        """Resolve path to a file in the model directory."""
+        if not self.model_path:
+            return None
+        # Handle local paths
+        local_path = Path(self.model_path) / filename
+        if local_path.exists():
+            return local_path
+        # Handle HuggingFace cache paths
+        cache_dir = Path.home() / ".cache" / "huggingface" / "hub"
+        if cache_dir.exists():
+            # Search for model in cache
+            for model_dir in cache_dir.glob("models--*"):
+                model_name = model_dir.name.replace("models--", "").replace("--", "/")
+                if model_name.lower() == self.model_path.lower().replace("/", "--"):
+                    snapshot_path = model_dir / "snapshots"
+                    if snapshot_path.exists():
+                        # Get latest snapshot
+                        snapshots = list(snapshot_path.iterdir())
+                        if snapshots:
+                            file_path = snapshots[-1] / filename
+                            if file_path.exists():
+                                return file_path
+        return None
+    def _detect_quantization(
+        self,
+        config: Optional[Dict[str, Any]],
+        quant_config: Optional[Dict[str, Any]],
+    ) -> QuantizationInfo:
+        """Detect quantization from config files."""
+        # Check for GPTQ via quantize_config.json
+        if quant_config:
+            if "bits" in quant_config:
+                return QuantizationInfo(
+                    method="GPTQ",
+                    bits=quant_config.get("bits", 4),
+                    group_size=quant_config.get("group_size", 128),
+                    desc_act=quant_config.get("desc_act", False),
+                    sym=quant_config.get("sym", True),
+                    raw_config=quant_config,
+                )
+        if not config:
+            return QuantizationInfo(method="Unknown", bits=16)
+        # Check for quantization_config in config.json
+        qc = config.get("quantization_config", {})
+        if qc:
+            quant_method = qc.get("quant_method", "").lower()
+            # AWQ
+            if quant_method == "awq":
+                return QuantizationInfo(
+                    method="AWQ",
+                    bits=qc.get("bits", 4),
+                    group_size=qc.get("group_size", 128),
+                    raw_config=qc,
+                )
+            # GPTQ (in config.json)
+            if quant_method == "gptq":
+                return QuantizationInfo(
+                    method="GPTQ",
+                    bits=qc.get("bits", 4),
+                    group_size=qc.get("group_size", 128),
+                    desc_act=qc.get("desc_act", False),
+                    sym=qc.get("sym", True),
+                    raw_config=qc,
+                )
+            # bitsandbytes
+            if qc.get("load_in_4bit") or qc.get("load_in_8bit"):
+                bits = 4 if qc.get("load_in_4bit") else 8
+                return QuantizationInfo(
+                    method="bitsandbytes",
+                    bits=bits,
+                    compute_dtype=qc.get("bnb_4bit_compute_dtype", "float16"),
+                    quant_type=qc.get("bnb_4bit_quant_type", "nf4"),
+                    double_quant=qc.get("bnb_4bit_use_double_quant", False),
+                    raw_config=qc,
+                )
+        # Check torch_dtype for fp16/bf16
+        torch_dtype = config.get("torch_dtype", "float16")
+        if torch_dtype in ("float16", "bfloat16"):
+            return QuantizationInfo(
+                method="None (FP16/BF16)",
+                bits=16,
+                compute_dtype=torch_dtype,
+            )
+        return QuantizationInfo(method="Unknown", bits=16)
+    def get_layer_precisions(self) -> List[LayerPrecision]:
+        """
+        Get per-layer precision information.
+        Returns:
+            List of LayerPrecision for each layer
+        """
+        info = self.detect()
+        # For quantized models, all layers typically have same precision
+        # This could be extended to parse safetensors index for more detail
+        index_path = self._resolve_path("model.safetensors.index.json")
+        if not index_path or not index_path.exists():
+            return []
+        try:
+            with open(index_path) as f:
+                index = json.load(f)
+            weight_map = index.get("weight_map", {})
+            layers = []
+            seen_layers = set()
+            for weight_name in weight_map.keys():
+                # Extract layer name
+                parts = weight_name.split(".")
+                if len(parts) >= 3:
+                    layer_name = ".".join(parts[:3])
+                    if layer_name not in seen_layers:
+                        seen_layers.add(layer_name)
+                        layers.append(
+                            LayerPrecision(
+                                layer_name=layer_name,
+                                bits=info.bits,
+                                group_size=info.group_size,
+                                dtype=info.compute_dtype or "float16",
+                            )
+                        )
+            return layers
+        except Exception as e:
+            logger.error(f"Error parsing layer precisions: {e}")
+            return []

collectors/vllm_collector.py ADDED Viewed

	@@ -0,0 +1,226 @@

+"""vLLM metrics collector via Prometheus endpoint."""
+import requests
+import logging
+from dataclasses import dataclass, field
+from typing import Optional, Dict, List, Any
+from datetime import datetime
+from utils.prometheus_parser import (
+    parse_prometheus_metrics,
+    get_metric_value,
+    get_histogram_quantile,
+    MetricSample,
+)
+logger = logging.getLogger(__name__)
+@dataclass
+class InferenceMetrics:
+    """Inference metrics from vLLM."""
+    timestamp: datetime = field(default_factory=datetime.now)
+    # Request counts
+    num_requests_running: int = 0
+    num_requests_waiting: int = 0
+    num_requests_swapped: int = 0
+    # Token throughput
+    prompt_tokens_total: int = 0
+    generation_tokens_total: int = 0
+    tokens_per_second: float = 0.0
+    # Latency
+    ttft_ms: float = 0.0  # Time to first token
+    tpot_ms: float = 0.0  # Time per output token
+    e2e_latency_ms: float = 0.0  # End-to-end latency
+    # Cache
+    kv_cache_usage_percent: float = 0.0
+    gpu_cache_usage_percent: float = 0.0
+    cpu_cache_usage_percent: float = 0.0
+    # Model info
+    model_name: str = ""
+    max_model_len: int = 0
+    # Derived
+    prefill_ratio: float = 0.0
+    batch_size: int = 0
+class VLLMCollector:
+    """Collects metrics from vLLM Prometheus endpoint."""
+    def __init__(self, metrics_url: str = "http://localhost:8000/metrics"):
+        """
+        Initialize the vLLM collector.
+        Args:
+            metrics_url: URL to vLLM's /metrics endpoint
+        """
+        self.metrics_url = metrics_url
+        self._last_prompt_tokens = 0
+        self._last_generation_tokens = 0
+        self._last_collect_time: Optional[datetime] = None
+        self._connected = False
+    def check_connection(self) -> bool:
+        """Check if vLLM server is accessible."""
+        try:
+            response = requests.get(self.metrics_url, timeout=2)
+            self._connected = response.status_code == 200
+            return self._connected
+        except Exception:
+            self._connected = False
+            return False
+    @property
+    def is_connected(self) -> bool:
+        """Return connection status."""
+        return self._connected
+    def collect(self) -> InferenceMetrics:
+        """
+        Collect all inference metrics from vLLM.
+        Returns:
+            InferenceMetrics dataclass with current values
+        """
+        metrics = InferenceMetrics()
+        try:
+            response = requests.get(self.metrics_url, timeout=5)
+            response.raise_for_status()
+            self._connected = True
+            raw_metrics = parse_prometheus_metrics(response.text)
+            metrics = self._parse_metrics(raw_metrics)
+        except requests.exceptions.ConnectionError:
+            self._connected = False
+            logger.debug("Cannot connect to vLLM metrics endpoint")
+        except Exception as e:
+            logger.error(f"Error collecting vLLM metrics: {e}")
+        return metrics
+    def _parse_metrics(self, raw: Dict[str, List[MetricSample]]) -> InferenceMetrics:
+        """Parse raw Prometheus metrics into InferenceMetrics."""
+        now = datetime.now()
+        metrics = InferenceMetrics(timestamp=now)
+        # Request counts
+        metrics.num_requests_running = int(
+            get_metric_value(raw, "vllm:num_requests_running") or 0
+        )
+        metrics.num_requests_waiting = int(
+            get_metric_value(raw, "vllm:num_requests_waiting") or 0
+        )
+        metrics.num_requests_swapped = int(
+            get_metric_value(raw, "vllm:num_requests_swapped") or 0
+        )
+        metrics.batch_size = metrics.num_requests_running
+        # Token counts (counters)
+        prompt_tokens = int(get_metric_value(raw, "vllm:prompt_tokens_total") or 0)
+        generation_tokens = int(
+            get_metric_value(raw, "vllm:generation_tokens_total") or 0
+        )
+        # Calculate tokens per second
+        if self._last_collect_time:
+            time_delta = (now - self._last_collect_time).total_seconds()
+            if time_delta > 0:
+                token_delta = generation_tokens - self._last_generation_tokens
+                metrics.tokens_per_second = token_delta / time_delta
+        self._last_prompt_tokens = prompt_tokens
+        self._last_generation_tokens = generation_tokens
+        self._last_collect_time = now
+        metrics.prompt_tokens_total = prompt_tokens
+        metrics.generation_tokens_total = generation_tokens
+        # Calculate prefill ratio
+        total_tokens = prompt_tokens + generation_tokens
+        if total_tokens > 0:
+            metrics.prefill_ratio = prompt_tokens / total_tokens
+        # Latency metrics (from histograms, use P50)
+        ttft = get_histogram_quantile(raw, "vllm:time_to_first_token_seconds", 0.5)
+        if ttft is not None:
+            metrics.ttft_ms = ttft * 1000
+        tpot = get_histogram_quantile(raw, "vllm:time_per_output_token_seconds", 0.5)
+        if tpot is not None:
+            metrics.tpot_ms = tpot * 1000
+        e2e = get_histogram_quantile(raw, "vllm:e2e_request_latency_seconds", 0.5)
+        if e2e is not None:
+            metrics.e2e_latency_ms = e2e * 1000
+        # Cache usage
+        metrics.gpu_cache_usage_percent = (
+            get_metric_value(raw, "vllm:gpu_cache_usage_perc") or 0
+        ) * 100
+        metrics.cpu_cache_usage_percent = (
+            get_metric_value(raw, "vllm:cpu_cache_usage_perc") or 0
+        ) * 100
+        metrics.kv_cache_usage_percent = metrics.gpu_cache_usage_percent
+        # Model info
+        model_name = self._get_model_name(raw)
+        if model_name:
+            metrics.model_name = model_name
+        return metrics
+    def _get_model_name(self, raw: Dict[str, List[MetricSample]]) -> Optional[str]:
+        """Extract model name from metrics labels."""
+        # Look for model name in any metric with model_name label
+        for metric_name, samples in raw.items():
+            for sample in samples:
+                if "model_name" in sample.labels:
+                    return sample.labels["model_name"]
+        return None
+    def get_rank_mapping(self) -> Dict[int, int]:
+        """
+        Get tensor parallel rank to GPU mapping.
+        Returns:
+            Dictionary mapping TP rank to GPU ID
+        """
+        # This would typically come from vLLM's internal state
+        # For now, return empty mapping - can be extended
+        return {}
+    def get_latency_percentiles(self) -> Dict[str, Dict[str, float]]:
+        """
+        Get latency percentiles for detailed analysis.
+        Returns:
+            Dictionary with P50, P95, P99 for each latency metric
+        """
+        try:
+            response = requests.get(self.metrics_url, timeout=5)
+            raw = parse_prometheus_metrics(response.text)
+            result = {}
+            for metric_base in [
+                "vllm:time_to_first_token_seconds",
+                "vllm:time_per_output_token_seconds",
+                "vllm:e2e_request_latency_seconds",
+            ]:
+                result[metric_base] = {
+                    "p50": (get_histogram_quantile(raw, metric_base, 0.5) or 0) * 1000,
+                    "p95": (get_histogram_quantile(raw, metric_base, 0.95) or 0) * 1000,
+                    "p99": (get_histogram_quantile(raw, metric_base, 0.99) or 0) * 1000,
+                }
+            return result
+        except Exception as e:
+            logger.error(f"Error getting latency percentiles: {e}")
+            return {}

components/__init__.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""UI components for the Gradio dashboard."""
+from .gpu_panel import create_gpu_panel, update_gpu_panel
+from .inference_panel import create_inference_panel, update_inference_panel
+from .quant_panel import create_quant_panel, update_quant_panel
+from .loading_panel import create_loading_panel, update_loading_panel
+from .alerts_panel import create_alerts_panel, update_alerts_panel
+from .tracing_panel import create_tracing_panel, update_tracing_panel
+from .comparison_panel import create_comparison_panel
+from .loadtest_panel import create_loadtest_panel
+__all__ = [
+    "create_gpu_panel",
+    "update_gpu_panel",
+    "create_inference_panel",
+    "update_inference_panel",
+    "create_quant_panel",
+    "update_quant_panel",
+    "create_loading_panel",
+    "update_loading_panel",
+    "create_alerts_panel",
+    "update_alerts_panel",
+    "create_tracing_panel",
+    "update_tracing_panel",
+    "create_comparison_panel",
+    "create_loadtest_panel",
+]

components/alerts_panel.py ADDED Viewed

	@@ -0,0 +1,253 @@

+"""Alerts configuration and history panel component."""
+import gradio as gr
+import pandas as pd
+from datetime import datetime
+from typing import Dict, Any, Tuple, List
+from services.alerting import AlertEngine, AlertDispatcher, Alert, AlertSeverity
+def create_alerts_panel(
+    alert_engine: AlertEngine,
+    alert_dispatcher: AlertDispatcher,
+) -> Dict[str, Any]:
+    """
+    Create the alerts panel.
+    Args:
+        alert_engine: Alert engine instance
+        alert_dispatcher: Alert dispatcher instance
+    Returns:
+        Dictionary of Gradio components
+    """
+    with gr.Column():
+        with gr.Row():
+            # Active alerts column
+            with gr.Column(scale=2):
+                gr.Markdown("### Active Alerts")
+                active_alerts_table = gr.Dataframe(
+                    headers=["Time", "Severity", "Metric", "Value", "Threshold", "Message"],
+                    datatype=["str", "str", "str", "number", "number", "str"],
+                    label="Active Alerts",
+                    interactive=False,
+                )
+                gr.Markdown("### Alert History")
+                alert_history_table = gr.Dataframe(
+                    headers=["Time", "Severity", "Message", "Resolved"],
+                    datatype=["str", "str", "str", "str"],
+                    label="Recent Alerts",
+                    interactive=False,
+                )
+            # Configuration column
+            with gr.Column(scale=1):
+                gr.Markdown("### Alert Configuration")
+                kv_threshold = gr.Slider(
+                    label="KV Cache Alert Threshold (%)",
+                    minimum=50,
+                    maximum=100,
+                    value=90,
+                    step=5,
+                )
+                gpu_memory_threshold = gr.Slider(
+                    label="GPU Memory Alert Threshold (%)",
+                    minimum=70,
+                    maximum=100,
+                    value=95,
+                    step=5,
+                )
+                ttft_multiplier = gr.Slider(
+                    label="TTFT Spike Multiplier",
+                    minimum=1.5,
+                    maximum=5,
+                    value=2,
+                    step=0.5,
+                )
+                throughput_drop = gr.Slider(
+                    label="Throughput Drop Alert (%)",
+                    minimum=20,
+                    maximum=80,
+                    value=50,
+                    step=10,
+                )
+                gr.Markdown("### Webhook Configuration")
+                slack_webhook = gr.Textbox(
+                    label="Slack Webhook URL",
+                    placeholder="https://hooks.slack.com/services/...",
+                    type="password",
+                )
+                pagerduty_key = gr.Textbox(
+                    label="PagerDuty Routing Key",
+                    placeholder="Enter routing key...",
+                    type="password",
+                )
+                with gr.Row():
+                    save_config_btn = gr.Button("Save Configuration")
+                    test_alert_btn = gr.Button("Send Test Alert", variant="secondary")
+                config_status = gr.Textbox(
+                    label="Status",
+                    interactive=False,
+                    visible=True,
+                )
+        # Event handlers
+        def save_config(kv, gpu, ttft, tp_drop, slack, pd_key):
+            # Update alert thresholds
+            if "kv_cache_high" in alert_engine.rules:
+                alert_engine.rules["kv_cache_high"].threshold = kv
+            if "gpu_memory_critical" in alert_engine.rules:
+                alert_engine.rules["gpu_memory_critical"].threshold = gpu
+            if "ttft_spike" in alert_engine.rules:
+                alert_engine.rules["ttft_spike"].multiplier = ttft
+            if "throughput_drop" in alert_engine.rules:
+                alert_engine.rules["throughput_drop"].percent = tp_drop
+            # Update webhook config
+            alert_dispatcher.slack_webhook = slack if slack else None
+            alert_dispatcher.pagerduty_key = pd_key if pd_key else None
+            return "Configuration saved successfully"
+        save_config_btn.click(
+            fn=save_config,
+            inputs=[
+                kv_threshold,
+                gpu_memory_threshold,
+                ttft_multiplier,
+                throughput_drop,
+                slack_webhook,
+                pagerduty_key,
+            ],
+            outputs=config_status,
+        )
+        async def send_test():
+            success = await alert_dispatcher.send_test_alert()
+            if success:
+                return "Test alert sent successfully"
+            return "Failed to send test alert - check webhook configuration"
+        test_alert_btn.click(
+            fn=send_test,
+            outputs=config_status,
+        )
+    return {
+        "active_alerts_table": active_alerts_table,
+        "alert_history_table": alert_history_table,
+        "kv_threshold": kv_threshold,
+        "gpu_memory_threshold": gpu_memory_threshold,
+        "ttft_multiplier": ttft_multiplier,
+        "throughput_drop": throughput_drop,
+        "slack_webhook": slack_webhook,
+        "pagerduty_key": pagerduty_key,
+        "config_status": config_status,
+    }
+def update_alerts_panel(
+    alert_engine: AlertEngine,
+    db=None,
+) -> Tuple[pd.DataFrame, pd.DataFrame]:
+    """
+    Update the alerts panel with current data.
+    Args:
+        alert_engine: Alert engine instance
+        db: Optional database for history
+    Returns:
+        Tuple of (active_alerts_df, history_df)
+    """
+    # Get active alerts
+    active = alert_engine.get_active_alerts()
+    active_rows = []
+    for alert in active:
+        active_rows.append({
+            "Time": alert.timestamp.strftime("%H:%M:%S"),
+            "Severity": _format_severity(alert.severity),
+            "Metric": alert.metric,
+            "Value": round(alert.value, 2),
+            "Threshold": round(alert.threshold, 2),
+            "Message": alert.message,
+        })
+    active_df = pd.DataFrame(active_rows) if active_rows else pd.DataFrame(
+        columns=["Time", "Severity", "Metric", "Value", "Threshold", "Message"]
+    )
+    # Get history from database
+    history_rows = []
+    if db:
+        recent = db.get_recent_alerts(limit=20)
+        for record in recent:
+            history_rows.append({
+                "Time": record.timestamp.strftime("%Y-%m-%d %H:%M:%S"),
+                "Severity": _format_severity_str(record.severity),
+                "Message": record.message,
+                "Resolved": "Yes" if record.resolved_at else "No",
+            })
+    history_df = pd.DataFrame(history_rows) if history_rows else pd.DataFrame(
+        columns=["Time", "Severity", "Message", "Resolved"]
+    )
+    return active_df, history_df
+def _format_severity(severity: AlertSeverity) -> str:
+    """Format severity for display."""
+    icons = {
+        AlertSeverity.INFO: "INFO",
+        AlertSeverity.WARNING: "WARNING",
+        AlertSeverity.CRITICAL: "CRITICAL",
+    }
+    return icons.get(severity, "UNKNOWN")
+def _format_severity_str(severity: str) -> str:
+    """Format severity string for display."""
+    return severity.upper()
+def get_alert_badge_html(alerts: List[Alert]) -> str:
+    """
+    Generate HTML badge for active alerts.
+    Args:
+        alerts: List of active alerts
+    Returns:
+        HTML string for badge
+    """
+    if not alerts:
+        return '<span style="color: #2e7d32;">No Active Alerts</span>'
+    critical = sum(1 for a in alerts if a.severity == AlertSeverity.CRITICAL)
+    warning = sum(1 for a in alerts if a.severity == AlertSeverity.WARNING)
+    badges = []
+    if critical > 0:
+        badges.append(
+            f'<span style="background: #c62828; color: white; padding: 2px 8px; '
+            f'border-radius: 12px; margin-right: 5px;">{critical} Critical</span>'
+        )
+    if warning > 0:
+        badges.append(
+            f'<span style="background: #ff9800; color: white; padding: 2px 8px; '
+            f'border-radius: 12px;">{warning} Warning</span>'
+        )
+    return "".join(badges)

components/comparison_panel.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""A/B comparison panel component."""
+import gradio as gr
+import pandas as pd
+import asyncio
+from typing import Dict, Any, Tuple
+from services.comparator import ABComparator, DeploymentConfig, ComparisonResult
+def create_comparison_panel() -> Dict[str, Any]:
+    """
+    Create the A/B comparison panel.
+    Returns:
+        Dictionary of Gradio components
+    """
+    with gr.Column():
+        gr.Markdown("### A/B Deployment Comparison")
+        # Endpoint configuration
+        with gr.Row():
+            endpoint_a = gr.Textbox(
+                label="Deployment A",
+                value="http://localhost:8000",
+                placeholder="http://host:port",
+            )
+            name_a = gr.Textbox(
+                label="Name A",
+                value="Baseline",
+                placeholder="e.g., FP16-baseline",
+            )
+        with gr.Row():
+            endpoint_b = gr.Textbox(
+                label="Deployment B",
+                value="http://localhost:8001",
+                placeholder="http://host:port",
+            )
+            name_b = gr.Textbox(
+                label="Name B",
+                value="Candidate",
+                placeholder="e.g., AWQ-4bit",
+            )
+        with gr.Row():
+            compare_btn = gr.Button("Compare Now", variant="primary")
+            collect_samples_btn = gr.Button("Collect Samples (30s)", variant="secondary")
+        # Status
+        comparison_status = gr.Textbox(
+            label="Status",
+            interactive=False,
+        )
+        # Results side by side
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("### Deployment A")
+                a_connected = gr.Checkbox(label="Connected", interactive=False)
+                a_throughput = gr.Number(label="Throughput (tok/s)", precision=1, interactive=False)
+                a_ttft = gr.Number(label="TTFT (ms)", precision=1, interactive=False)
+                a_latency = gr.Number(label="E2E Latency (ms)", precision=1, interactive=False)
+                a_kv_cache = gr.Number(label="KV Cache %", precision=1, interactive=False)
+                a_batch = gr.Number(label="Batch Size", precision=0, interactive=False)
+            with gr.Column():
+                gr.Markdown("### Deployment B")
+                b_connected = gr.Checkbox(label="Connected", interactive=False)
+                b_throughput = gr.Number(label="Throughput (tok/s)", precision=1, interactive=False)
+                b_ttft = gr.Number(label="TTFT (ms)", precision=1, interactive=False)
+                b_latency = gr.Number(label="E2E Latency (ms)", precision=1, interactive=False)
+                b_kv_cache = gr.Number(label="KV Cache %", precision=1, interactive=False)
+                b_batch = gr.Number(label="Batch Size", precision=0, interactive=False)
+        # Comparison table
+        gr.Markdown("### Comparison Summary")
+        comparison_table = gr.Dataframe(
+            headers=["Metric", "Deployment A", "Deployment B", "Difference"],
+            label="Comparison",
+            interactive=False,
+        )
+        # Recommendation
+        recommendation = gr.Markdown("")
+        # Statistical significance
+        with gr.Row():
+            significance_throughput = gr.Textbox(
+                label="Throughput Significance",
+                interactive=False,
+            )
+            significance_latency = gr.Textbox(
+                label="Latency Significance",
+                interactive=False,
+            )
+        # Event handlers
+        async def run_comparison(ep_a, name_a_val, ep_b, name_b_val):
+            config_a = DeploymentConfig(name=name_a_val, endpoint=ep_a)
+            config_b = DeploymentConfig(name=name_b_val, endpoint=ep_b)
+            comparator = ABComparator(config_a, config_b)
+            result = await comparator.compare()
+            return format_comparison_results(result, comparator)
+        compare_btn.click(
+            fn=run_comparison,
+            inputs=[endpoint_a, name_a, endpoint_b, name_b],
+            outputs=[
+                comparison_status,
+                a_connected, a_throughput, a_ttft, a_latency, a_kv_cache, a_batch,
+                b_connected, b_throughput, b_ttft, b_latency, b_kv_cache, b_batch,
+                comparison_table, recommendation,
+                significance_throughput, significance_latency,
+            ],
+        )
+        async def collect_and_compare(ep_a, name_a_val, ep_b, name_b_val):
+            config_a = DeploymentConfig(name=name_a_val, endpoint=ep_a)
+            config_b = DeploymentConfig(name=name_b_val, endpoint=ep_b)
+            comparator = ABComparator(config_a, config_b)
+            # Collect samples (this takes ~30 seconds)
+            yield (
+                "Collecting samples (0/30)...",
+                *[None] * 15  # Placeholder outputs
+            )
+            await comparator.collect_samples(count=30)
+            result = await comparator.compare()
+            yield format_comparison_results(result, comparator)
+        collect_samples_btn.click(
+            fn=collect_and_compare,
+            inputs=[endpoint_a, name_a, endpoint_b, name_b],
+            outputs=[
+                comparison_status,
+                a_connected, a_throughput, a_ttft, a_latency, a_kv_cache, a_batch,
+                b_connected, b_throughput, b_ttft, b_latency, b_kv_cache, b_batch,
+                comparison_table, recommendation,
+                significance_throughput, significance_latency,
+            ],
+        )
+    return {
+        "endpoint_a": endpoint_a,
+        "name_a": name_a,
+        "endpoint_b": endpoint_b,
+        "name_b": name_b,
+        "comparison_status": comparison_status,
+        "a_connected": a_connected,
+        "a_throughput": a_throughput,
+        "a_ttft": a_ttft,
+        "a_latency": a_latency,
+        "a_kv_cache": a_kv_cache,
+        "a_batch": a_batch,
+        "b_connected": b_connected,
+        "b_throughput": b_throughput,
+        "b_ttft": b_ttft,
+        "b_latency": b_latency,
+        "b_kv_cache": b_kv_cache,
+        "b_batch": b_batch,
+        "comparison_table": comparison_table,
+        "recommendation": recommendation,
+        "significance_throughput": significance_throughput,
+        "significance_latency": significance_latency,
+    }
+def format_comparison_results(
+    result: ComparisonResult,
+    comparator: ABComparator,
+) -> Tuple:
+    """Format comparison results for UI components."""
+    a = result.deployment_a
+    b = result.deployment_b
+    # Build comparison table
+    table_data = comparator.get_comparison_table(result)
+    table_df = pd.DataFrame(table_data)
+    # Format recommendation
+    recommendation_md = f"**Recommendation:** {result.recommendation}"
+    # Format significance
+    sig_throughput = "Not tested"
+    sig_latency = "Not tested"
+    if result.p_value_throughput < 1.0:
+        sig_status = "Significant" if result.throughput_significant else "Not significant"
+        sig_throughput = f"{sig_status} (p={result.p_value_throughput:.4f})"
+    if result.p_value_latency < 1.0:
+        sig_status = "Significant" if result.latency_significant else "Not significant"
+        sig_latency = f"{sig_status} (p={result.p_value_latency:.4f})"
+    return (
+        "Comparison complete",
+        a.connected, a.tokens_per_second, a.ttft_ms, a.e2e_latency_ms, a.kv_cache_percent, a.batch_size,
+        b.connected, b.tokens_per_second, b.ttft_ms, b.e2e_latency_ms, b.kv_cache_percent, b.batch_size,
+        table_df, recommendation_md,
+        sig_throughput, sig_latency,
+    )

components/gpu_panel.py ADDED Viewed

	@@ -0,0 +1,191 @@

+"""GPU status panel component."""
+import gradio as gr
+import pandas as pd
+from typing import List, Dict, Any, Tuple
+from collectors.gpu_collector import GPUCollector, GPUStats
+from utils.history import MetricHistory
+def create_gpu_panel(history: MetricHistory) -> Dict[str, Any]:
+    """
+    Create the GPU status panel.
+    Args:
+        history: Metric history for charting
+    Returns:
+        Dictionary of Gradio components
+    """
+    with gr.Column():
+        gr.Markdown("### GPU / Rank Status")
+        # GPU stats table
+        gpu_table = gr.Dataframe(
+            headers=["GPU", "Name", "Memory", "Memory %", "Util %", "Temp", "Power", "TP Rank"],
+            datatype=["number", "str", "str", "number", "number", "str", "str", "str"],
+            label="GPU Statistics",
+            interactive=False,
+        )
+        with gr.Row():
+            # Memory usage plot
+            gpu_memory_plot = gr.LinePlot(
+                x="time",
+                y="value",
+                color="gpu",
+                title="GPU Memory Usage (GB)",
+                x_title="Time",
+                y_title="Memory (GB)",
+                height=250,
+            )
+            # Utilization plot
+            gpu_util_plot = gr.LinePlot(
+                x="time",
+                y="value",
+                color="gpu",
+                title="GPU Utilization (%)",
+                x_title="Time",
+                y_title="Utilization %",
+                height=250,
+            )
+        # NCCL / Communication status
+        nccl_status = gr.HTML(
+            value='<div style="padding: 10px; background: #e8f5e9; border-radius: 5px;">'
+            '<span style="color: #2e7d32;">NCCL Status: Healthy</span></div>',
+            label="Communication Status",
+        )
+    return {
+        "gpu_table": gpu_table,
+        "gpu_memory_plot": gpu_memory_plot,
+        "gpu_util_plot": gpu_util_plot,
+        "nccl_status": nccl_status,
+    }
+def update_gpu_panel(
+    collector: GPUCollector,
+    history: MetricHistory,
+) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, str]:
+    """
+    Update the GPU panel with current data.
+    Args:
+        collector: GPU collector instance
+        history: Metric history
+    Returns:
+        Tuple of (table_data, memory_plot_data, util_plot_data, nccl_html)
+    """
+    stats = collector.collect()
+    # Update history
+    for stat in stats:
+        history.add(
+            "gpu_memory_gb",
+            stat.memory_used_gb,
+            labels={"gpu": str(stat.gpu_id)},
+        )
+        history.add(
+            "gpu_util_percent",
+            stat.gpu_util_percent,
+            labels={"gpu": str(stat.gpu_id)},
+        )
+    # Build table data
+    table_data = _build_table(stats)
+    # Build chart data
+    memory_df = _build_memory_chart_data(history)
+    util_df = _build_util_chart_data(history)
+    # NCCL status (simplified - would need more complex detection)
+    nccl_html = _build_nccl_status(stats)
+    return table_data, memory_df, util_df, nccl_html
+def _build_table(stats: List[GPUStats]) -> pd.DataFrame:
+    """Build GPU stats table."""
+    rows = []
+    for stat in stats:
+        rows.append({
+            "GPU": stat.gpu_id,
+            "Name": stat.name[:20] if len(stat.name) > 20 else stat.name,
+            "Memory": f"{stat.memory_used_gb:.1f}/{stat.memory_total_gb:.1f} GB",
+            "Memory %": round(stat.memory_percent, 1),
+            "Util %": round(stat.gpu_util_percent, 1),
+            "Temp": f"{stat.temperature_c}C",
+            "Power": f"{stat.power_watts:.0f}/{stat.power_limit_watts:.0f}W",
+            "TP Rank": str(stat.tp_rank) if stat.tp_rank is not None else "-",
+        })
+    return pd.DataFrame(rows)
+def _build_memory_chart_data(history: MetricHistory) -> pd.DataFrame:
+    """Build memory usage chart data."""
+    all_series = history.get_all_series("gpu_memory_gb")
+    rows = []
+    for key, points in all_series.items():
+        gpu_id = key.split("=")[-1] if "=" in key else "0"
+        for point in points[-60:]:  # Last 60 points
+            rows.append({
+                "time": point.timestamp,
+                "value": point.value,
+                "gpu": f"GPU {gpu_id}",
+            })
+    if not rows:
+        return pd.DataFrame({"time": [], "value": [], "gpu": []})
+    return pd.DataFrame(rows)
+def _build_util_chart_data(history: MetricHistory) -> pd.DataFrame:
+    """Build utilization chart data."""
+    all_series = history.get_all_series("gpu_util_percent")
+    rows = []
+    for key, points in all_series.items():
+        gpu_id = key.split("=")[-1] if "=" in key else "0"
+        for point in points[-60:]:
+            rows.append({
+                "time": point.timestamp,
+                "value": point.value,
+                "gpu": f"GPU {gpu_id}",
+            })
+    if not rows:
+        return pd.DataFrame({"time": [], "value": [], "gpu": []})
+    return pd.DataFrame(rows)
+def _build_nccl_status(stats: List[GPUStats]) -> str:
+    """Build NCCL status HTML."""
+    if not stats:
+        return (
+            '<div style="padding: 10px; background: #fff3e0; border-radius: 5px;">'
+            '<span style="color: #e65100;">NCCL Status: No GPUs detected</span></div>'
+        )
+    # Check for GPU communication health indicators
+    # In a real implementation, this would check vLLM metrics for NCCL errors
+    all_healthy = all(stat.gpu_util_percent > 0 or stat.memory_percent > 0 for stat in stats)
+    if all_healthy:
+        return (
+            '<div style="padding: 10px; background: #e8f5e9; border-radius: 5px;">'
+            f'<span style="color: #2e7d32;">NCCL Status: Healthy ({len(stats)} GPUs)</span></div>'
+        )
+    else:
+        return (
+            '<div style="padding: 10px; background: #ffebee; border-radius: 5px;">'
+            '<span style="color: #c62828;">NCCL Status: Communication issue detected</span></div>'
+        )

components/inference_panel.py ADDED Viewed

	@@ -0,0 +1,209 @@

+"""Inference metrics panel component."""
+import gradio as gr
+import pandas as pd
+from typing import Dict, Any, Tuple
+from collectors.vllm_collector import VLLMCollector, InferenceMetrics
+from utils.history import MetricHistory
+def create_inference_panel(history: MetricHistory) -> Dict[str, Any]:
+    """
+    Create the inference metrics panel.
+    Args:
+        history: Metric history for charting
+    Returns:
+        Dictionary of Gradio components
+    """
+    with gr.Column():
+        gr.Markdown("### Inference Metrics")
+        # Key metrics row
+        with gr.Row():
+            throughput = gr.Number(
+                label="Tokens/sec",
+                precision=1,
+                interactive=False,
+            )
+            ttft = gr.Number(
+                label="TTFT (ms)",
+                precision=1,
+                interactive=False,
+            )
+            batch_size = gr.Number(
+                label="Batch Size",
+                precision=0,
+                interactive=False,
+            )
+            kv_cache = gr.Number(
+                label="KV Cache %",
+                precision=1,
+                interactive=False,
+            )
+        # Throughput plot
+        throughput_plot = gr.LinePlot(
+            x="time",
+            y="value",
+            title="Throughput Over Time",
+            x_title="Time",
+            y_title="Tokens/sec",
+            height=250,
+        )
+        # Secondary metrics row
+        with gr.Row():
+            prefill_pct = gr.Number(
+                label="Prefill %",
+                precision=1,
+                interactive=False,
+            )
+            decode_pct = gr.Number(
+                label="Decode %",
+                precision=1,
+                interactive=False,
+            )
+            queue_depth = gr.Number(
+                label="Queue Depth",
+                precision=0,
+                interactive=False,
+            )
+            e2e_latency = gr.Number(
+                label="E2E Latency (ms)",
+                precision=1,
+                interactive=False,
+            )
+        # Latency plot
+        latency_plot = gr.LinePlot(
+            x="time",
+            y="value",
+            color="metric",
+            title="Latency Over Time",
+            x_title="Time",
+            y_title="Latency (ms)",
+            height=250,
+        )
+    return {
+        "throughput": throughput,
+        "ttft": ttft,
+        "batch_size": batch_size,
+        "kv_cache": kv_cache,
+        "throughput_plot": throughput_plot,
+        "prefill_pct": prefill_pct,
+        "decode_pct": decode_pct,
+        "queue_depth": queue_depth,
+        "e2e_latency": e2e_latency,
+        "latency_plot": latency_plot,
+    }
+def update_inference_panel(
+    collector: VLLMCollector,
+    history: MetricHistory,
+) -> Tuple[float, float, int, float, pd.DataFrame, float, float, int, float, pd.DataFrame]:
+    """
+    Update the inference panel with current data.
+    Args:
+        collector: vLLM collector instance
+        history: Metric history
+    Returns:
+        Tuple of all metric values and chart data
+    """
+    metrics = collector.collect()
+    # Update history
+    history.add("tokens_per_second", metrics.tokens_per_second)
+    history.add("ttft_ms", metrics.ttft_ms)
+    history.add("e2e_latency_ms", metrics.e2e_latency_ms)
+    history.add("kv_cache_percent", metrics.kv_cache_usage_percent)
+    # Build throughput chart
+    throughput_df = _build_throughput_chart(history)
+    # Build latency chart
+    latency_df = _build_latency_chart(history)
+    # Calculate prefill/decode percentages
+    prefill_pct = metrics.prefill_ratio * 100
+    decode_pct = 100 - prefill_pct
+    return (
+        metrics.tokens_per_second,
+        metrics.ttft_ms,
+        metrics.batch_size,
+        metrics.kv_cache_usage_percent,
+        throughput_df,
+        prefill_pct,
+        decode_pct,
+        metrics.num_requests_waiting,
+        metrics.e2e_latency_ms,
+        latency_df,
+    )
+def _build_throughput_chart(history: MetricHistory) -> pd.DataFrame:
+    """Build throughput chart data."""
+    points = history.get("tokens_per_second", limit=60)
+    if not points:
+        return pd.DataFrame({"time": [], "value": []})
+    return pd.DataFrame([
+        {"time": p.timestamp, "value": p.value}
+        for p in points
+    ])
+def _build_latency_chart(history: MetricHistory) -> pd.DataFrame:
+    """Build latency chart data with multiple series."""
+    ttft_points = history.get("ttft_ms", limit=60)
+    e2e_points = history.get("e2e_latency_ms", limit=60)
+    rows = []
+    for p in ttft_points:
+        rows.append({
+            "time": p.timestamp,
+            "value": p.value,
+            "metric": "TTFT",
+        })
+    for p in e2e_points:
+        rows.append({
+            "time": p.timestamp,
+            "value": p.value,
+            "metric": "E2E",
+        })
+    if not rows:
+        return pd.DataFrame({"time": [], "value": [], "metric": []})
+    return pd.DataFrame(rows)
+def get_metrics_dict(metrics: InferenceMetrics) -> Dict[str, float]:
+    """
+    Convert metrics to dictionary for alerting.
+    Args:
+        metrics: InferenceMetrics instance
+    Returns:
+        Dictionary of metric name to value
+    """
+    return {
+        "tokens_per_second": metrics.tokens_per_second,
+        "ttft_ms": metrics.ttft_ms,
+        "e2e_latency_ms": metrics.e2e_latency_ms,
+        "kv_cache_percent": metrics.kv_cache_usage_percent,
+        "batch_size": metrics.batch_size,
+        "queue_depth": metrics.num_requests_waiting,
+        "gpu_cache_percent": metrics.gpu_cache_usage_percent,
+    }

components/loading_panel.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""Model loading progress panel component."""
+import gradio as gr
+import pandas as pd
+from typing import Dict, Any, Tuple
+from collectors.loading_tracker import LoadingTracker, LoadingStatus
+def create_loading_panel() -> Dict[str, Any]:
+    """
+    Create the loading progress panel.
+    Returns:
+        Dictionary of Gradio components
+    """
+    with gr.Column():
+        gr.Markdown("### Model Loading Progress")
+        # Status indicator
+        loading_status = gr.HTML(
+            value=_build_status_html(LoadingStatus.NOT_STARTED),
+        )
+        # Progress bar
+        loading_progress = gr.Slider(
+            label="Loading Progress",
+            minimum=0,
+            maximum=100,
+            value=0,
+            interactive=False,
+        )
+        with gr.Row():
+            shards_loaded = gr.Textbox(
+                label="Shards Loaded",
+                value="0 / 0",
+                interactive=False,
+            )
+            layers_loaded = gr.Textbox(
+                label="Layers Loaded",
+                value="0 / 0",
+                interactive=False,
+            )
+            eta = gr.Textbox(
+                label="ETA",
+                value="-",
+                interactive=False,
+            )
+        # Shard details table
+        gr.Markdown("#### Shard Details")
+        shard_table = gr.Dataframe(
+            headers=["Shard", "Size (MB)", "Status", "Layers"],
+            datatype=["str", "number", "str", "str"],
+            label="Shards",
+            interactive=False,
+        )
+    return {
+        "loading_status": loading_status,
+        "loading_progress": loading_progress,
+        "shards_loaded": shards_loaded,
+        "layers_loaded": layers_loaded,
+        "eta": eta,
+        "shard_table": shard_table,
+    }
+def update_loading_panel(
+    tracker: LoadingTracker,
+) -> Tuple[str, float, str, str, str, pd.DataFrame]:
+    """
+    Update the loading panel with current data.
+    Args:
+        tracker: Loading tracker instance
+    Returns:
+        Tuple of (status_html, progress, shards_text, layers_text, eta_text, shard_table)
+    """
+    progress = tracker.get_progress()
+    shards = tracker.get_shards()
+    # Build status HTML
+    status_html = _build_status_html(progress.status)
+    # Build shard table
+    shard_rows = []
+    for shard in shards[:20]:
+        shard_rows.append({
+            "Shard": shard.filename,
+            "Size (MB)": round(shard.size_mb, 1),
+            "Status": _format_shard_status(shard.status),
+            "Layers": str(len(shard.layers)),
+        })
+    shard_df = pd.DataFrame(shard_rows) if shard_rows else pd.DataFrame(
+        columns=["Shard", "Size (MB)", "Status", "Layers"]
+    )
+    # Format text values
+    shards_text = f"{progress.loaded_shards} / {progress.total_shards}"
+    layers_text = f"{progress.layers_loaded} / {progress.total_layers}"
+    if progress.estimated_remaining_seconds:
+        minutes = int(progress.estimated_remaining_seconds // 60)
+        seconds = int(progress.estimated_remaining_seconds % 60)
+        eta_text = f"{minutes}m {seconds}s"
+    else:
+        eta_text = "-"
+    return (
+        status_html,
+        progress.progress_percent,
+        shards_text,
+        layers_text,
+        eta_text,
+        shard_df,
+    )
+def _build_status_html(status: LoadingStatus) -> str:
+    """Build HTML for loading status."""
+    status_configs = {
+        LoadingStatus.NOT_STARTED: ("Not Started", "#9e9e9e", "#fafafa"),
+        LoadingStatus.DOWNLOADING: ("Downloading", "#1976d2", "#e3f2fd"),
+        LoadingStatus.LOADING: ("Loading", "#ff9800", "#fff3e0"),
+        LoadingStatus.READY: ("Ready", "#2e7d32", "#e8f5e9"),
+        LoadingStatus.ERROR: ("Error", "#c62828", "#ffebee"),
+    }
+    text, color, bg_color = status_configs.get(
+        status, ("Unknown", "#9e9e9e", "#fafafa")
+    )
+    return (
+        f'<div style="padding: 10px; background: {bg_color}; '
+        f'border-radius: 5px; text-align: center;">'
+        f'<span style="color: {color}; font-weight: bold; font-size: 1.2em;">'
+        f'{text}</span></div>'
+    )
+def _format_shard_status(status: str) -> str:
+    """Format shard status with indicator."""
+    if status == "loaded":
+        return "Loaded"
+    if status == "loading":
+        return "Loading..."
+    return "Pending"

components/loadtest_panel.py ADDED Viewed

	@@ -0,0 +1,220 @@

+"""Load testing panel component."""
+import gradio as gr
+import pandas as pd
+import asyncio
+from typing import Dict, Any, Optional
+from services.load_tester import LoadTester, LoadTestConfig
+from storage.models import LoadTestResult
+# Global load tester instance (managed by the panel)
+_active_load_tester: Optional[LoadTester] = None
+def create_loadtest_panel() -> Dict[str, Any]:
+    """
+    Create the load testing panel.
+    Returns:
+        Dictionary of Gradio components
+    """
+    global _active_load_tester
+    with gr.Column():
+        gr.Markdown("### Load Testing")
+        # Configuration
+        with gr.Row():
+            target_endpoint = gr.Textbox(
+                label="Target Endpoint",
+                value="http://localhost:8000",
+                placeholder="http://host:port",
+            )
+        with gr.Row():
+            concurrent_users = gr.Slider(
+                label="Concurrent Users",
+                minimum=1,
+                maximum=100,
+                value=10,
+                step=1,
+            )
+            requests_per_second = gr.Slider(
+                label="Requests/Second",
+                minimum=0.1,
+                maximum=50,
+                value=5,
+                step=0.5,
+            )
+            duration = gr.Slider(
+                label="Duration (seconds)",
+                minimum=10,
+                maximum=300,
+                value=60,
+                step=10,
+            )
+        with gr.Row():
+            prompt_distribution = gr.Dropdown(
+                choices=["fixed", "realistic", "random"],
+                value="fixed",
+                label="Prompt Distribution",
+            )
+            max_tokens = gr.Slider(
+                label="Max Tokens",
+                minimum=10,
+                maximum=500,
+                value=100,
+                step=10,
+            )
+        # Control buttons
+        with gr.Row():
+            start_btn = gr.Button("Start Load Test", variant="primary")
+            stop_btn = gr.Button("Stop", variant="stop")
+        # Status
+        test_status = gr.Textbox(
+            label="Status",
+            interactive=False,
+        )
+        # Progress
+        with gr.Row():
+            progress_elapsed = gr.Number(label="Elapsed (s)", precision=0, interactive=False)
+            progress_requests = gr.Number(label="Total Requests", precision=0, interactive=False)
+            progress_success = gr.Number(label="Successful", precision=0, interactive=False)
+            progress_failed = gr.Number(label="Failed", precision=0, interactive=False)
+        # Results
+        gr.Markdown("### Results")
+        with gr.Row():
+            result_avg = gr.Number(label="Avg Latency (ms)", precision=1, interactive=False)
+            result_p50 = gr.Number(label="P50 (ms)", precision=1, interactive=False)
+            result_p95 = gr.Number(label="P95 (ms)", precision=1, interactive=False)
+            result_p99 = gr.Number(label="P99 (ms)", precision=1, interactive=False)
+        with gr.Row():
+            result_throughput = gr.Number(label="Throughput (req/s)", precision=2, interactive=False)
+            result_saturation = gr.Number(label="Saturation Point", precision=1, interactive=False)
+        # Latency over time chart
+        latency_chart = gr.LinePlot(
+            x="time",
+            y="latency_ms",
+            title="Latency Over Time",
+            x_title="Time",
+            y_title="Latency (ms)",
+            height=250,
+        )
+        # Event handlers
+        async def start_load_test(endpoint, users, rps, dur, dist, max_tok):
+            global _active_load_tester
+            config = LoadTestConfig(
+                target_endpoint=endpoint,
+                concurrent_users=int(users),
+                requests_per_second=rps,
+                duration_seconds=int(dur),
+                prompt_length_distribution=dist,
+                max_tokens=int(max_tok),
+            )
+            _active_load_tester = LoadTester(config)
+            # Initial status
+            yield (
+                "Starting load test...",
+                0, 0, 0, 0,
+                0, 0, 0, 0, 0, None,
+                pd.DataFrame({"time": [], "latency_ms": []}),
+            )
+            try:
+                result = await _active_load_tester.run()
+                yield format_load_test_results(result, _active_load_tester)
+            except Exception as e:
+                yield (
+                    f"Error: {str(e)}",
+                    0, 0, 0, 0,
+                    0, 0, 0, 0, 0, None,
+                    pd.DataFrame({"time": [], "latency_ms": []}),
+                )
+        def stop_load_test():
+            global _active_load_tester
+            if _active_load_tester:
+                _active_load_tester.stop()
+                return "Load test stopped"
+            return "No active load test"
+        start_btn.click(
+            fn=start_load_test,
+            inputs=[
+                target_endpoint, concurrent_users, requests_per_second,
+                duration, prompt_distribution, max_tokens
+            ],
+            outputs=[
+                test_status,
+                progress_elapsed, progress_requests, progress_success, progress_failed,
+                result_avg, result_p50, result_p95, result_p99, result_throughput, result_saturation,
+                latency_chart,
+            ],
+        )
+        stop_btn.click(
+            fn=stop_load_test,
+            outputs=test_status,
+        )
+    return {
+        "target_endpoint": target_endpoint,
+        "concurrent_users": concurrent_users,
+        "requests_per_second": requests_per_second,
+        "duration": duration,
+        "prompt_distribution": prompt_distribution,
+        "max_tokens": max_tokens,
+        "test_status": test_status,
+        "progress_elapsed": progress_elapsed,
+        "progress_requests": progress_requests,
+        "progress_success": progress_success,
+        "progress_failed": progress_failed,
+        "result_avg": result_avg,
+        "result_p50": result_p50,
+        "result_p95": result_p95,
+        "result_p99": result_p99,
+        "result_throughput": result_throughput,
+        "result_saturation": result_saturation,
+        "latency_chart": latency_chart,
+    }
+def format_load_test_results(
+    result: LoadTestResult,
+    tester: LoadTester,
+) -> tuple:
+    """Format load test results for UI components."""
+    # Build latency timeseries
+    timeseries = tester.get_latency_timeseries()
+    if timeseries:
+        latency_df = pd.DataFrame(timeseries)
+    else:
+        latency_df = pd.DataFrame({"time": [], "latency_ms": []})
+    return (
+        f"Load test complete: {result.total_requests} requests",
+        result.duration_seconds,
+        result.total_requests,
+        result.successful_requests,
+        result.failed_requests,
+        result.avg_latency_ms,
+        result.p50_latency_ms,
+        result.p95_latency_ms,
+        result.p99_latency_ms,
+        result.throughput_rps,
+        result.saturation_point,
+        latency_df,
+    )

components/quant_panel.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""Quantization details panel component."""
+import gradio as gr
+import pandas as pd
+from typing import Dict, Any, Tuple, Optional
+from collectors.quant_collector import QuantizationCollector, QuantizationInfo
+def create_quant_panel() -> Dict[str, Any]:
+    """
+    Create the quantization details panel.
+    Returns:
+        Dictionary of Gradio components
+    """
+    with gr.Column():
+        gr.Markdown("### Quantization Details")
+        with gr.Row():
+            quant_type = gr.Textbox(
+                label="Quantization Method",
+                interactive=False,
+            )
+            bits = gr.Number(
+                label="Bits",
+                precision=0,
+                interactive=False,
+            )
+            group_size = gr.Number(
+                label="Group Size",
+                precision=0,
+                interactive=False,
+            )
+        # Full configuration JSON
+        quant_details = gr.JSON(
+            label="Full Configuration",
+        )
+        # Layer precision table
+        gr.Markdown("#### Per-Layer Precision")
+        layer_table = gr.Dataframe(
+            headers=["Layer", "Bits", "Group Size", "Dtype"],
+            datatype=["str", "number", "str", "str"],
+            label="Layer Precisions",
+            interactive=False,
+        )
+    return {
+        "quant_type": quant_type,
+        "bits": bits,
+        "group_size": group_size,
+        "quant_details": quant_details,
+        "layer_table": layer_table,
+    }
+def update_quant_panel(
+    collector: QuantizationCollector,
+) -> Tuple[str, int, Optional[int], Dict, pd.DataFrame]:
+    """
+    Update the quantization panel with current data.
+    Args:
+        collector: Quantization collector instance
+    Returns:
+        Tuple of (method, bits, group_size, details_json, layer_table)
+    """
+    info = collector.detect()
+    layers = collector.get_layer_precisions()
+    # Build layer table
+    layer_rows = []
+    for layer in layers[:20]:  # Limit to 20 rows
+        layer_rows.append({
+            "Layer": layer.layer_name,
+            "Bits": layer.bits,
+            "Group Size": str(layer.group_size) if layer.group_size else "-",
+            "Dtype": layer.dtype,
+        })
+    layer_df = pd.DataFrame(layer_rows) if layer_rows else pd.DataFrame(
+        columns=["Layer", "Bits", "Group Size", "Dtype"]
+    )
+    return (
+        info.method,
+        info.bits,
+        info.group_size,
+        info.to_dict(),
+        layer_df,
+    )
+def get_quant_summary(info: QuantizationInfo) -> str:
+    """
+    Get a summary string for the quantization.
+    Args:
+        info: QuantizationInfo instance
+    Returns:
+        Human-readable summary string
+    """
+    if info.method == "None (FP16/BF16)":
+        return f"Full precision ({info.compute_dtype or 'float16'})"
+    summary = f"{info.method} {info.bits}-bit"
+    if info.group_size:
+        summary += f", group size {info.group_size}"
+    if info.quant_type:
+        summary += f" ({info.quant_type})"
+    return summary

components/tracing_panel.py ADDED Viewed

	@@ -0,0 +1,186 @@

+"""Request tracing panel component."""
+import gradio as gr
+import pandas as pd
+from typing import Dict, Any, Tuple
+from services.request_tracer import RequestTracer
+def create_tracing_panel(tracer: RequestTracer) -> Dict[str, Any]:
+    """
+    Create the request tracing panel.
+    Args:
+        tracer: Request tracer instance
+    Returns:
+        Dictionary of Gradio components
+    """
+    with gr.Column():
+        gr.Markdown("### Request Tracing")
+        # Filter controls
+        with gr.Row():
+            trace_filter = gr.Dropdown(
+                choices=["All Requests", "Slow Only"],
+                value="All Requests",
+                label="Filter",
+            )
+            trace_limit = gr.Slider(
+                minimum=10,
+                maximum=500,
+                value=100,
+                step=10,
+                label="Show Last N Requests",
+            )
+            refresh_btn = gr.Button("Refresh", size="sm")
+        # Summary stats
+        with gr.Row():
+            total_requests = gr.Number(
+                label="Total Requests",
+                precision=0,
+                interactive=False,
+            )
+            slow_requests = gr.Number(
+                label="Slow Requests",
+                precision=0,
+                interactive=False,
+            )
+            slow_rate = gr.Number(
+                label="Slow Rate %",
+                precision=1,
+                interactive=False,
+            )
+            baseline_p95 = gr.Number(
+                label="Baseline P95 (ms)",
+                precision=1,
+                interactive=False,
+            )
+        # Traces table
+        traces_table = gr.Dataframe(
+            headers=[
+                "ID", "Prompt Toks", "Output Toks",
+                "Queue (ms)", "Prefill (ms)", "Decode (ms)",
+                "Total (ms)", "Tok/s", "Slow?"
+            ],
+            datatype=[
+                "str", "number", "number",
+                "number", "number", "number",
+                "number", "number", "str"
+            ],
+            label="Request Traces",
+            interactive=False,
+        )
+        # Latency breakdown chart
+        gr.Markdown("#### Average Latency Breakdown")
+        latency_breakdown = gr.BarPlot(
+            x="phase",
+            y="ms",
+            title="Latency by Phase",
+            x_title="Phase",
+            y_title="Time (ms)",
+            height=200,
+        )
+        # Percentiles
+        gr.Markdown("#### Latency Percentiles")
+        with gr.Row():
+            p50 = gr.Number(label="P50 (ms)", precision=1, interactive=False)
+            p95 = gr.Number(label="P95 (ms)", precision=1, interactive=False)
+            p99 = gr.Number(label="P99 (ms)", precision=1, interactive=False)
+        # Event handlers
+        def refresh_traces(filter_val, limit):
+            slow_only = filter_val == "Slow Only"
+            return update_tracing_panel(tracer, slow_only, int(limit))
+        refresh_btn.click(
+            fn=refresh_traces,
+            inputs=[trace_filter, trace_limit],
+            outputs=[
+                total_requests, slow_requests, slow_rate, baseline_p95,
+                traces_table, latency_breakdown, p50, p95, p99
+            ],
+        )
+    return {
+        "trace_filter": trace_filter,
+        "trace_limit": trace_limit,
+        "total_requests": total_requests,
+        "slow_requests": slow_requests,
+        "slow_rate": slow_rate,
+        "baseline_p95": baseline_p95,
+        "traces_table": traces_table,
+        "latency_breakdown": latency_breakdown,
+        "p50": p50,
+        "p95": p95,
+        "p99": p99,
+    }
+def update_tracing_panel(
+    tracer: RequestTracer,
+    slow_only: bool = False,
+    limit: int = 100,
+) -> Tuple[int, int, float, float, pd.DataFrame, pd.DataFrame, float, float, float]:
+    """
+    Update the tracing panel with current data.
+    Args:
+        tracer: Request tracer instance
+        slow_only: Only show slow requests
+        limit: Maximum number of traces to show
+    Returns:
+        Tuple of all component values
+    """
+    stats = tracer.get_stats()
+    traces = tracer.get_recent_traces(limit=limit, slow_only=slow_only)
+    breakdown = tracer.get_latency_breakdown()
+    percentiles = tracer.get_percentiles()
+    # Build traces table
+    trace_rows = []
+    for trace in reversed(traces):  # Most recent first
+        trace_rows.append({
+            "ID": trace.request_id,
+            "Prompt Toks": trace.prompt_tokens,
+            "Output Toks": trace.output_tokens,
+            "Queue (ms)": round(trace.queue_time_ms, 1),
+            "Prefill (ms)": round(trace.prefill_time_ms, 1),
+            "Decode (ms)": round(trace.decode_time_ms, 1),
+            "Total (ms)": round(trace.total_time_ms, 1),
+            "Tok/s": round(trace.tokens_per_second, 1),
+            "Slow?": "Yes" if trace.is_slow else "",
+        })
+    traces_df = pd.DataFrame(trace_rows) if trace_rows else pd.DataFrame(
+        columns=[
+            "ID", "Prompt Toks", "Output Toks",
+            "Queue (ms)", "Prefill (ms)", "Decode (ms)",
+            "Total (ms)", "Tok/s", "Slow?"
+        ]
+    )
+    # Build breakdown chart
+    breakdown_df = pd.DataFrame([
+        {"phase": "Queue", "ms": breakdown.queue_ms},
+        {"phase": "Prefill", "ms": breakdown.prefill_ms},
+        {"phase": "Decode", "ms": breakdown.decode_ms},
+    ])
+    return (
+        stats["total_requests"],
+        stats["slow_requests"],
+        stats.get("slow_rate_percent", 0),
+        stats.get("baseline_p95", 0) or 0,
+        traces_df,
+        breakdown_df,
+        percentiles["p50"],
+        percentiles["p95"],
+        percentiles["p99"],
+    )

config.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""Configuration settings for LLM Inference Dashboard."""
+from dataclasses import dataclass, field
+from typing import Optional
+import os
+@dataclass
+class Config:
+    """Dashboard configuration with sensible defaults."""
+    # vLLM Connection
+    vllm_host: str = "localhost"
+    vllm_port: int = 8000
+    model_path: Optional[str] = None
+    # Dashboard
+    refresh_interval: float = 1.0
+    history_length: int = 300  # 5 minutes at 1s intervals
+    # Database
+    db_path: str = "data/metrics.db"
+    # Alert Thresholds
+    alert_kv_cache_threshold: float = 90.0
+    alert_gpu_memory_threshold: float = 95.0
+    alert_ttft_multiplier: float = 2.0
+    alert_throughput_drop_pct: float = 50.0
+    # Webhooks
+    slack_webhook: Optional[str] = None
+    pagerduty_routing_key: Optional[str] = None
+    generic_webhooks: list = field(default_factory=list)
+    # Load Testing Defaults
+    loadtest_concurrent_users: int = 10
+    loadtest_rps: float = 5.0
+    loadtest_duration: int = 60
+    @property
+    def metrics_endpoint(self) -> str:
+        return f"http://{self.vllm_host}:{self.vllm_port}/metrics"
+    @property
+    def openai_endpoint(self) -> str:
+        return f"http://{self.vllm_host}:{self.vllm_port}/v1"
+    @property
+    def health_endpoint(self) -> str:
+        return f"http://{self.vllm_host}:{self.vllm_port}/health"
+    @classmethod
+    def from_env(cls) -> "Config":
+        """Create config from environment variables."""
+        return cls(
+            vllm_host=os.getenv("VLLM_HOST", "localhost"),
+            vllm_port=int(os.getenv("VLLM_PORT", "8000")),
+            model_path=os.getenv("MODEL_PATH"),
+            refresh_interval=float(os.getenv("REFRESH_INTERVAL", "1.0")),
+            db_path=os.getenv("DB_PATH", "data/metrics.db"),
+            slack_webhook=os.getenv("SLACK_WEBHOOK"),
+            pagerduty_routing_key=os.getenv("PAGERDUTY_KEY"),
+        )
+# Global config instance
+config = Config.from_env()

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+# Core
+gradio>=5.0.0
+requests>=2.28.0
+aiohttp>=3.9.0
+# Data processing
+pandas>=2.0.0
+numpy>=1.24.0
+scipy>=1.11.0
+# Model utilities
+safetensors>=0.4.0
+huggingface-hub>=0.20.0
+# GPU monitoring (optional - will use mock data if unavailable)
+nvidia-ml-py3>=7.352.0

services/__init__.py ADDED Viewed

	@@ -0,0 +1,18 @@

+"""Services for alerting, tracing, comparison, and load testing."""
+from .alerting import AlertEngine, AlertDispatcher, Alert
+from .request_tracer import RequestTracer
+from .comparator import ABComparator, DeploymentConfig, ComparisonResult
+from .load_tester import LoadTester, LoadTestConfig
+__all__ = [
+    "AlertEngine",
+    "AlertDispatcher",
+    "Alert",
+    "RequestTracer",
+    "ABComparator",
+    "DeploymentConfig",
+    "ComparisonResult",
+    "LoadTester",
+    "LoadTestConfig",
+]

services/alerting.py ADDED Viewed

	@@ -0,0 +1,421 @@

+"""Alert engine and webhook dispatch for monitoring thresholds."""
+import asyncio
+import logging
+from dataclasses import dataclass, field
+from datetime import datetime
+from typing import Dict, List, Optional, Any, Callable
+from enum import Enum
+import aiohttp
+from storage.database import MetricsDB
+from storage.models import AlertRecord
+logger = logging.getLogger(__name__)
+class AlertSeverity(Enum):
+    INFO = "info"
+    WARNING = "warning"
+    CRITICAL = "critical"
+@dataclass
+class AlertRule:
+    """Configuration for an alert rule."""
+    name: str
+    metric: str
+    condition: str  # >, <, >=, <=, ==
+    threshold: float
+    severity: AlertSeverity
+    message: str
+    # For dynamic thresholds
+    threshold_type: str = "static"  # static, baseline_multiplier, baseline_percent
+    multiplier: float = 1.0
+    percent: float = 100.0
+    cooldown_seconds: int = 60
+@dataclass
+class Alert:
+    """A triggered alert instance."""
+    rule_name: str
+    metric: str
+    value: float
+    threshold: float
+    severity: AlertSeverity
+    message: str
+    timestamp: datetime = field(default_factory=datetime.now)
+    resolved: bool = False
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "rule_name": self.rule_name,
+            "metric": self.metric,
+            "value": self.value,
+            "threshold": self.threshold,
+            "severity": self.severity.value,
+            "message": self.message,
+            "timestamp": self.timestamp.isoformat(),
+            "resolved": self.resolved,
+        }
+# Default alert rules
+DEFAULT_RULES = {
+    "kv_cache_high": AlertRule(
+        name="kv_cache_high",
+        metric="kv_cache_percent",
+        condition=">",
+        threshold=90.0,
+        severity=AlertSeverity.WARNING,
+        message="KV cache utilization above 90%",
+    ),
+    "gpu_memory_critical": AlertRule(
+        name="gpu_memory_critical",
+        metric="gpu_memory_percent",
+        condition=">",
+        threshold=95.0,
+        severity=AlertSeverity.CRITICAL,
+        message="GPU memory critically high (>95%)",
+    ),
+    "ttft_spike": AlertRule(
+        name="ttft_spike",
+        metric="ttft_ms",
+        condition=">",
+        threshold=0,  # Dynamic
+        threshold_type="baseline_multiplier",
+        multiplier=2.0,
+        severity=AlertSeverity.WARNING,
+        message="Time to first token spiked to 2x baseline",
+    ),
+    "throughput_drop": AlertRule(
+        name="throughput_drop",
+        metric="tokens_per_second",
+        condition="<",
+        threshold=0,  # Dynamic
+        threshold_type="baseline_percent",
+        percent=50.0,
+        severity=AlertSeverity.WARNING,
+        message="Throughput dropped below 50% of baseline",
+    ),
+    "queue_buildup": AlertRule(
+        name="queue_buildup",
+        metric="queue_depth",
+        condition=">",
+        threshold=50.0,
+        severity=AlertSeverity.WARNING,
+        message="Request queue depth exceeds 50",
+    ),
+}
+class AlertEngine:
+    """Evaluates metrics against alert rules."""
+    def __init__(self, db: Optional[MetricsDB] = None):
+        """
+        Initialize alert engine.
+        Args:
+            db: Optional database for persisting alerts
+        """
+        self.db = db
+        self.rules: Dict[str, AlertRule] = dict(DEFAULT_RULES)
+        self.active_alerts: Dict[str, Alert] = {}
+        self.baselines: Dict[str, float] = {}
+        self._last_trigger_times: Dict[str, datetime] = {}
+        self._callbacks: List[Callable[[Alert], None]] = []
+    def add_rule(self, rule: AlertRule) -> None:
+        """Add or update an alert rule."""
+        self.rules[rule.name] = rule
+    def remove_rule(self, name: str) -> None:
+        """Remove an alert rule."""
+        self.rules.pop(name, None)
+    def set_baseline(self, metric: str, value: float) -> None:
+        """Set baseline value for a metric."""
+        self.baselines[metric] = value
+    def update_baselines(self, metrics: Dict[str, float]) -> None:
+        """Update baseline values from current metrics."""
+        for metric, value in metrics.items():
+            if metric not in self.baselines and value > 0:
+                self.baselines[metric] = value
+    def on_alert(self, callback: Callable[[Alert], None]) -> None:
+        """Register callback for new alerts."""
+        self._callbacks.append(callback)
+    def evaluate(self, metrics: Dict[str, float]) -> List[Alert]:
+        """
+        Evaluate metrics against all rules.
+        Args:
+            metrics: Current metric values
+        Returns:
+            List of newly triggered alerts
+        """
+        new_alerts = []
+        for rule_name, rule in self.rules.items():
+            if rule.metric not in metrics:
+                continue
+            value = metrics[rule.metric]
+            threshold = self._get_threshold(rule)
+            if threshold is None:
+                continue
+            triggered = self._check_condition(value, rule.condition, threshold)
+            if triggered:
+                # Check cooldown
+                if rule_name in self._last_trigger_times:
+                    elapsed = (
+                        datetime.now() - self._last_trigger_times[rule_name]
+                    ).total_seconds()
+                    if elapsed < rule.cooldown_seconds:
+                        continue
+                # Create alert
+                alert = Alert(
+                    rule_name=rule_name,
+                    metric=rule.metric,
+                    value=value,
+                    threshold=threshold,
+                    severity=rule.severity,
+                    message=rule.message,
+                )
+                self.active_alerts[rule_name] = alert
+                self._last_trigger_times[rule_name] = datetime.now()
+                new_alerts.append(alert)
+                # Persist to database
+                if self.db:
+                    record = AlertRecord(
+                        rule_name=rule_name,
+                        severity=rule.severity.value,
+                        metric_name=rule.metric,
+                        value=value,
+                        threshold=threshold,
+                        message=rule.message,
+                    )
+                    self.db.insert_alert(record)
+                # Notify callbacks
+                for callback in self._callbacks:
+                    try:
+                        callback(alert)
+                    except Exception as e:
+                        logger.error(f"Alert callback error: {e}")
+            elif rule_name in self.active_alerts:
+                # Resolve alert
+                self.active_alerts[rule_name].resolved = True
+                del self.active_alerts[rule_name]
+        return new_alerts
+    def _get_threshold(self, rule: AlertRule) -> Optional[float]:
+        """Calculate threshold for a rule."""
+        if rule.threshold_type == "static":
+            return rule.threshold
+        baseline = self.baselines.get(rule.metric)
+        if baseline is None:
+            return None
+        if rule.threshold_type == "baseline_multiplier":
+            return baseline * rule.multiplier
+        if rule.threshold_type == "baseline_percent":
+            return baseline * (rule.percent / 100.0)
+        return rule.threshold
+    def _check_condition(
+        self, value: float, condition: str, threshold: float
+    ) -> bool:
+        """Check if condition is met."""
+        if condition == ">":
+            return value > threshold
+        if condition == ">=":
+            return value >= threshold
+        if condition == "<":
+            return value < threshold
+        if condition == "<=":
+            return value <= threshold
+        if condition == "==":
+            return abs(value - threshold) < 0.001
+        return False
+    def get_active_alerts(self) -> List[Alert]:
+        """Get all active (unresolved) alerts."""
+        return list(self.active_alerts.values())
+    def clear_alert(self, rule_name: str) -> None:
+        """Manually clear an alert."""
+        if rule_name in self.active_alerts:
+            del self.active_alerts[rule_name]
+class AlertDispatcher:
+    """Dispatches alerts to external services."""
+    def __init__(
+        self,
+        slack_webhook: Optional[str] = None,
+        pagerduty_key: Optional[str] = None,
+        generic_webhooks: Optional[List[str]] = None,
+    ):
+        """
+        Initialize alert dispatcher.
+        Args:
+            slack_webhook: Slack incoming webhook URL
+            pagerduty_key: PagerDuty routing key
+            generic_webhooks: List of generic webhook URLs
+        """
+        self.slack_webhook = slack_webhook
+        self.pagerduty_key = pagerduty_key
+        self.generic_webhooks = generic_webhooks or []
+    async def dispatch(self, alert: Alert) -> None:
+        """
+        Dispatch alert to all configured services.
+        Args:
+            alert: Alert to dispatch
+        """
+        tasks = []
+        if self.slack_webhook:
+            tasks.append(self._send_slack(alert))
+        if self.pagerduty_key and alert.severity == AlertSeverity.CRITICAL:
+            tasks.append(self._send_pagerduty(alert))
+        for webhook in self.generic_webhooks:
+            tasks.append(self._send_generic(webhook, alert))
+        if tasks:
+            await asyncio.gather(*tasks, return_exceptions=True)
+    async def _send_slack(self, alert: Alert) -> None:
+        """Send alert to Slack."""
+        color = "danger" if alert.severity == AlertSeverity.CRITICAL else "warning"
+        emoji = "🚨" if alert.severity == AlertSeverity.CRITICAL else "⚠️"
+        payload = {
+            "text": f"{emoji} *{alert.severity.value.upper()}*: {alert.message}",
+            "attachments": [
+                {
+                    "color": color,
+                    "fields": [
+                        {
+                            "title": "Metric",
+                            "value": alert.metric,
+                            "short": True,
+                        },
+                        {
+                            "title": "Value",
+                            "value": f"{alert.value:.2f}",
+                            "short": True,
+                        },
+                        {
+                            "title": "Threshold",
+                            "value": f"{alert.threshold:.2f}",
+                            "short": True,
+                        },
+                        {
+                            "title": "Time",
+                            "value": alert.timestamp.strftime("%Y-%m-%d %H:%M:%S"),
+                            "short": True,
+                        },
+                    ],
+                }
+            ],
+        }
+        try:
+            async with aiohttp.ClientSession() as session:
+                async with session.post(
+                    self.slack_webhook,
+                    json=payload,
+                    timeout=aiohttp.ClientTimeout(total=10),
+                ) as response:
+                    if response.status != 200:
+                        logger.error(f"Slack webhook failed: {response.status}")
+        except Exception as e:
+            logger.error(f"Error sending Slack alert: {e}")
+    async def _send_pagerduty(self, alert: Alert) -> None:
+        """Send alert to PagerDuty."""
+        payload = {
+            "routing_key": self.pagerduty_key,
+            "event_action": "trigger",
+            "dedup_key": f"llm-dashboard-{alert.rule_name}",
+            "payload": {
+                "summary": alert.message,
+                "severity": "critical",
+                "source": "llm-inference-dashboard",
+                "custom_details": {
+                    "metric": alert.metric,
+                    "value": alert.value,
+                    "threshold": alert.threshold,
+                },
+            },
+        }
+        try:
+            async with aiohttp.ClientSession() as session:
+                async with session.post(
+                    "https://events.pagerduty.com/v2/enqueue",
+                    json=payload,
+                    timeout=aiohttp.ClientTimeout(total=10),
+                ) as response:
+                    if response.status != 202:
+                        logger.error(f"PagerDuty failed: {response.status}")
+        except Exception as e:
+            logger.error(f"Error sending PagerDuty alert: {e}")
+    async def _send_generic(self, webhook_url: str, alert: Alert) -> None:
+        """Send alert to a generic webhook."""
+        payload = alert.to_dict()
+        try:
+            async with aiohttp.ClientSession() as session:
+                async with session.post(
+                    webhook_url,
+                    json=payload,
+                    timeout=aiohttp.ClientTimeout(total=10),
+                ) as response:
+                    if response.status >= 400:
+                        logger.error(f"Webhook {webhook_url} failed: {response.status}")
+        except Exception as e:
+            logger.error(f"Error sending to webhook {webhook_url}: {e}")
+    async def send_test_alert(self) -> bool:
+        """Send a test alert to verify configuration."""
+        test_alert = Alert(
+            rule_name="test_alert",
+            metric="test_metric",
+            value=100.0,
+            threshold=50.0,
+            severity=AlertSeverity.INFO,
+            message="This is a test alert from LLM Inference Dashboard",
+        )
+        try:
+            await self.dispatch(test_alert)
+            return True
+        except Exception as e:
+            logger.error(f"Test alert failed: {e}")
+            return False

services/comparator.py ADDED Viewed

	@@ -0,0 +1,366 @@

+"""A/B comparison of vLLM deployments."""
+import asyncio
+import logging
+from dataclasses import dataclass, field
+from typing import Optional, Dict, List, Any
+from datetime import datetime
+import aiohttp
+from scipy import stats
+from utils.prometheus_parser import (
+    parse_prometheus_metrics,
+    get_metric_value,
+    get_histogram_quantile,
+)
+logger = logging.getLogger(__name__)
+@dataclass
+class DeploymentConfig:
+    """Configuration for a vLLM deployment."""
+    name: str
+    endpoint: str  # Base URL (e.g., http://localhost:8000)
+    model_name: str = ""
+    quantization: str = ""
+    @property
+    def metrics_url(self) -> str:
+        return f"{self.endpoint}/metrics"
+@dataclass
+class DeploymentMetrics:
+    """Metrics collected from a deployment."""
+    endpoint: str
+    timestamp: datetime = field(default_factory=datetime.now)
+    connected: bool = False
+    # Throughput
+    tokens_per_second: float = 0.0
+    throughput_samples: List[float] = field(default_factory=list)
+    # Latency
+    ttft_ms: float = 0.0
+    tpot_ms: float = 0.0
+    e2e_latency_ms: float = 0.0
+    latency_samples: List[float] = field(default_factory=list)
+    # Resources
+    gpu_memory_gb: float = 0.0
+    kv_cache_percent: float = 0.0
+    batch_size: int = 0
+    # Model info
+    model_name: str = ""
+@dataclass
+class ComparisonResult:
+    """Result of comparing two deployments."""
+    deployment_a: DeploymentMetrics
+    deployment_b: DeploymentMetrics
+    timestamp: datetime = field(default_factory=datetime.now)
+    # Differences
+    throughput_diff_pct: float = 0.0
+    ttft_diff_pct: float = 0.0
+    latency_diff_pct: float = 0.0
+    memory_diff_gb: float = 0.0
+    # Statistical significance
+    throughput_significant: bool = False
+    latency_significant: bool = False
+    p_value_throughput: float = 1.0
+    p_value_latency: float = 1.0
+    # Recommendation
+    recommendation: str = ""
+class ABComparator:
+    """Compares metrics between two vLLM deployments."""
+    def __init__(
+        self,
+        deployment_a: DeploymentConfig,
+        deployment_b: DeploymentConfig,
+        sample_count: int = 30,
+    ):
+        """
+        Initialize comparator.
+        Args:
+            deployment_a: First deployment configuration
+            deployment_b: Second deployment configuration
+            sample_count: Number of samples to collect for statistical tests
+        """
+        self.deployment_a = deployment_a
+        self.deployment_b = deployment_b
+        self.sample_count = sample_count
+        self._samples_a: List[DeploymentMetrics] = []
+        self._samples_b: List[DeploymentMetrics] = []
+    async def collect_metrics(self, config: DeploymentConfig) -> DeploymentMetrics:
+        """
+        Collect current metrics from a deployment.
+        Args:
+            config: Deployment configuration
+        Returns:
+            DeploymentMetrics with current values
+        """
+        metrics = DeploymentMetrics(endpoint=config.endpoint)
+        try:
+            async with aiohttp.ClientSession() as session:
+                async with session.get(
+                    config.metrics_url,
+                    timeout=aiohttp.ClientTimeout(total=5),
+                ) as response:
+                    if response.status != 200:
+                        return metrics
+                    text = await response.text()
+                    raw = parse_prometheus_metrics(text)
+                    metrics.connected = True
+                    # Parse metrics
+                    metrics.tokens_per_second = self._calculate_tps(raw)
+                    metrics.ttft_ms = (
+                        get_histogram_quantile(
+                            raw, "vllm:time_to_first_token_seconds", 0.5
+                        )
+                        or 0
+                    ) * 1000
+                    metrics.tpot_ms = (
+                        get_histogram_quantile(
+                            raw, "vllm:time_per_output_token_seconds", 0.5
+                        )
+                        or 0
+                    ) * 1000
+                    metrics.e2e_latency_ms = (
+                        get_histogram_quantile(
+                            raw, "vllm:e2e_request_latency_seconds", 0.5
+                        )
+                        or 0
+                    ) * 1000
+                    metrics.kv_cache_percent = (
+                        get_metric_value(raw, "vllm:gpu_cache_usage_perc") or 0
+                    ) * 100
+                    metrics.batch_size = int(
+                        get_metric_value(raw, "vllm:num_requests_running") or 0
+                    )
+                    # Model name from labels
+                    for samples in raw.values():
+                        for sample in samples:
+                            if "model_name" in sample.labels:
+                                metrics.model_name = sample.labels["model_name"]
+                                break
+        except Exception as e:
+            logger.error(f"Error collecting metrics from {config.endpoint}: {e}")
+        return metrics
+    def _calculate_tps(self, raw: Dict) -> float:
+        """Calculate tokens per second from counter metrics."""
+        # This is a simplified calculation
+        # In practice, you'd track delta over time
+        generation_total = get_metric_value(raw, "vllm:generation_tokens_total") or 0
+        if generation_total > 0:
+            # Estimate based on running requests
+            running = get_metric_value(raw, "vllm:num_requests_running") or 1
+            tpot = (
+                get_histogram_quantile(
+                    raw, "vllm:time_per_output_token_seconds", 0.5
+                )
+                or 0.05
+            )
+            if tpot > 0:
+                return running / tpot
+        return 0
+    async def collect_samples(self, count: Optional[int] = None) -> None:
+        """
+        Collect multiple samples for statistical comparison.
+        Args:
+            count: Number of samples to collect
+        """
+        if count is None:
+            count = self.sample_count
+        self._samples_a.clear()
+        self._samples_b.clear()
+        for i in range(count):
+            metrics_a, metrics_b = await asyncio.gather(
+                self.collect_metrics(self.deployment_a),
+                self.collect_metrics(self.deployment_b),
+            )
+            if metrics_a.connected:
+                metrics_a.throughput_samples = [metrics_a.tokens_per_second]
+                metrics_a.latency_samples = [metrics_a.e2e_latency_ms]
+                self._samples_a.append(metrics_a)
+            if metrics_b.connected:
+                metrics_b.throughput_samples = [metrics_b.tokens_per_second]
+                metrics_b.latency_samples = [metrics_b.e2e_latency_ms]
+                self._samples_b.append(metrics_b)
+            # Wait between samples
+            if i < count - 1:
+                await asyncio.sleep(1)
+    async def compare(self) -> ComparisonResult:
+        """
+        Perform comparison between deployments.
+        Returns:
+            ComparisonResult with comparison data
+        """
+        # Collect current metrics
+        metrics_a, metrics_b = await asyncio.gather(
+            self.collect_metrics(self.deployment_a),
+            self.collect_metrics(self.deployment_b),
+        )
+        result = ComparisonResult(
+            deployment_a=metrics_a,
+            deployment_b=metrics_b,
+        )
+        # Calculate differences
+        if metrics_a.tokens_per_second > 0:
+            result.throughput_diff_pct = (
+                (metrics_b.tokens_per_second - metrics_a.tokens_per_second)
+                / metrics_a.tokens_per_second
+            ) * 100
+        if metrics_a.ttft_ms > 0:
+            result.ttft_diff_pct = (
+                (metrics_b.ttft_ms - metrics_a.ttft_ms) / metrics_a.ttft_ms
+            ) * 100
+        if metrics_a.e2e_latency_ms > 0:
+            result.latency_diff_pct = (
+                (metrics_b.e2e_latency_ms - metrics_a.e2e_latency_ms)
+                / metrics_a.e2e_latency_ms
+            ) * 100
+        result.memory_diff_gb = metrics_b.gpu_memory_gb - metrics_a.gpu_memory_gb
+        # Statistical significance (if we have samples)
+        if self._samples_a and self._samples_b:
+            result = self._add_significance(result)
+        # Generate recommendation
+        result.recommendation = self._generate_recommendation(result)
+        return result
+    def _add_significance(self, result: ComparisonResult) -> ComparisonResult:
+        """Add statistical significance tests to result."""
+        tps_a = [s.tokens_per_second for s in self._samples_a]
+        tps_b = [s.tokens_per_second for s in self._samples_b]
+        lat_a = [s.e2e_latency_ms for s in self._samples_a]
+        lat_b = [s.e2e_latency_ms for s in self._samples_b]
+        if len(tps_a) >= 2 and len(tps_b) >= 2:
+            try:
+                _, p_tps = stats.ttest_ind(tps_a, tps_b)
+                result.p_value_throughput = p_tps
+                result.throughput_significant = p_tps < 0.05
+            except Exception:
+                pass
+        if len(lat_a) >= 2 and len(lat_b) >= 2:
+            try:
+                _, p_lat = stats.ttest_ind(lat_a, lat_b)
+                result.p_value_latency = p_lat
+                result.latency_significant = p_lat < 0.05
+            except Exception:
+                pass
+        return result
+    def _generate_recommendation(self, result: ComparisonResult) -> str:
+        """Generate a human-readable recommendation."""
+        parts = []
+        a_name = self.deployment_a.name
+        b_name = self.deployment_b.name
+        # Throughput comparison
+        if abs(result.throughput_diff_pct) > 5:
+            faster = b_name if result.throughput_diff_pct > 0 else a_name
+            diff = abs(result.throughput_diff_pct)
+            sig = " (statistically significant)" if result.throughput_significant else ""
+            parts.append(f"{faster} has {diff:.1f}% higher throughput{sig}")
+        # Latency comparison
+        if abs(result.latency_diff_pct) > 5:
+            faster = a_name if result.latency_diff_pct > 0 else b_name
+            diff = abs(result.latency_diff_pct)
+            sig = " (statistically significant)" if result.latency_significant else ""
+            parts.append(f"{faster} has {diff:.1f}% lower latency{sig}")
+        # Memory comparison
+        if abs(result.memory_diff_gb) > 1:
+            lower = a_name if result.memory_diff_gb > 0 else b_name
+            diff = abs(result.memory_diff_gb)
+            parts.append(f"{lower} uses {diff:.1f}GB less GPU memory")
+        if not parts:
+            return "Both deployments show similar performance"
+        return ". ".join(parts) + "."
+    def get_comparison_table(self, result: ComparisonResult) -> List[Dict[str, Any]]:
+        """
+        Generate comparison table data.
+        Args:
+            result: Comparison result
+        Returns:
+            List of rows for comparison table
+        """
+        return [
+            {
+                "Metric": "Throughput (tok/s)",
+                self.deployment_a.name: f"{result.deployment_a.tokens_per_second:.1f}",
+                self.deployment_b.name: f"{result.deployment_b.tokens_per_second:.1f}",
+                "Diff": f"{result.throughput_diff_pct:+.1f}%",
+            },
+            {
+                "Metric": "TTFT (ms)",
+                self.deployment_a.name: f"{result.deployment_a.ttft_ms:.1f}",
+                self.deployment_b.name: f"{result.deployment_b.ttft_ms:.1f}",
+                "Diff": f"{result.ttft_diff_pct:+.1f}%",
+            },
+            {
+                "Metric": "E2E Latency (ms)",
+                self.deployment_a.name: f"{result.deployment_a.e2e_latency_ms:.1f}",
+                self.deployment_b.name: f"{result.deployment_b.e2e_latency_ms:.1f}",
+                "Diff": f"{result.latency_diff_pct:+.1f}%",
+            },
+            {
+                "Metric": "KV Cache %",
+                self.deployment_a.name: f"{result.deployment_a.kv_cache_percent:.1f}",
+                self.deployment_b.name: f"{result.deployment_b.kv_cache_percent:.1f}",
+                "Diff": "-",
+            },
+            {
+                "Metric": "Batch Size",
+                self.deployment_a.name: str(result.deployment_a.batch_size),
+                self.deployment_b.name: str(result.deployment_b.batch_size),
+                "Diff": "-",
+            },
+        ]

services/load_tester.py ADDED Viewed

	@@ -0,0 +1,359 @@

+"""Load testing engine for vLLM endpoints."""
+import asyncio
+import logging
+import statistics
+import time
+import uuid
+from dataclasses import dataclass, field
+from datetime import datetime
+from typing import List, Optional, Dict, Any, Callable
+from collections import deque
+import aiohttp
+import numpy as np
+from storage.database import MetricsDB
+from storage.models import LoadTestResult
+logger = logging.getLogger(__name__)
+@dataclass
+class LoadTestConfig:
+    """Configuration for a load test."""
+    target_endpoint: str
+    concurrent_users: int = 10
+    requests_per_second: float = 5.0
+    duration_seconds: int = 60
+    prompt: str = "Hello, please write a short story about a robot."
+    max_tokens: int = 100
+    prompt_length_distribution: str = "fixed"  # fixed, realistic, random
+@dataclass
+class RequestResult:
+    """Result of a single request."""
+    success: bool
+    latency_ms: float
+    tokens: int
+    error: Optional[str] = None
+    timestamp: datetime = field(default_factory=datetime.now)
+class LoadTester:
+    """Load testing engine for vLLM inference endpoints."""
+    def __init__(self, config: LoadTestConfig, db: Optional[MetricsDB] = None):
+        """
+        Initialize load tester.
+        Args:
+            config: Load test configuration
+            db: Optional database for storing results
+        """
+        self.config = config
+        self.db = db
+        self.running = False
+        self._results: List[RequestResult] = []
+        self._latency_over_time: deque = deque(maxlen=10000)
+        self._progress_callback: Optional[Callable[[Dict], None]] = None
+        self._start_time: Optional[float] = None
+    def set_config(self, config: LoadTestConfig) -> None:
+        """Update configuration."""
+        self.config = config
+    def on_progress(self, callback: Callable[[Dict], None]) -> None:
+        """Register progress callback."""
+        self._progress_callback = callback
+    async def run(self) -> LoadTestResult:
+        """
+        Run the load test.
+        Returns:
+            LoadTestResult with test results
+        """
+        self.running = True
+        self._results = []
+        self._latency_over_time.clear()
+        self._start_time = time.time()
+        test_id = str(uuid.uuid4())[:8]
+        logger.info(
+            f"Starting load test {test_id}: "
+            f"{self.config.concurrent_users} users, "
+            f"{self.config.requests_per_second} RPS, "
+            f"{self.config.duration_seconds}s"
+        )
+        # Create semaphore for concurrency control
+        semaphore = asyncio.Semaphore(self.config.concurrent_users)
+        # Calculate request interval
+        interval = 1.0 / self.config.requests_per_second
+        # Generate load
+        tasks = []
+        end_time = time.time() + self.config.duration_seconds
+        try:
+            while time.time() < end_time and self.running:
+                async with semaphore:
+                    task = asyncio.create_task(self._make_request())
+                    tasks.append(task)
+                # Report progress
+                if self._progress_callback:
+                    self._progress_callback(self._get_progress())
+                await asyncio.sleep(interval)
+            # Wait for remaining tasks
+            if tasks:
+                await asyncio.gather(*tasks, return_exceptions=True)
+        except asyncio.CancelledError:
+            logger.info("Load test cancelled")
+        except Exception as e:
+            logger.error(f"Load test error: {e}")
+        finally:
+            self.running = False
+        # Analyze results
+        result = self._analyze_results(test_id)
+        # Persist to database
+        if self.db:
+            try:
+                self.db.insert_load_test(result)
+            except Exception as e:
+                logger.error(f"Error persisting load test: {e}")
+        return result
+    def stop(self) -> None:
+        """Stop the running load test."""
+        self.running = False
+    async def _make_request(self) -> None:
+        """Make a single request to the target endpoint."""
+        prompt = self._generate_prompt()
+        start = time.perf_counter()
+        tokens = 0
+        error = None
+        success = False
+        try:
+            async with aiohttp.ClientSession() as session:
+                payload = {
+                    "model": "default",
+                    "messages": [{"role": "user", "content": prompt}],
+                    "max_tokens": self.config.max_tokens,
+                    "stream": False,
+                }
+                async with session.post(
+                    f"{self.config.target_endpoint}/v1/chat/completions",
+                    json=payload,
+                    timeout=aiohttp.ClientTimeout(total=60),
+                ) as response:
+                    if response.status == 200:
+                        data = await response.json()
+                        tokens = data.get("usage", {}).get("completion_tokens", 0)
+                        success = True
+                    else:
+                        error = f"HTTP {response.status}"
+        except asyncio.TimeoutError:
+            error = "Timeout"
+        except Exception as e:
+            error = str(e)
+        latency = (time.perf_counter() - start) * 1000
+        result = RequestResult(
+            success=success,
+            latency_ms=latency,
+            tokens=tokens,
+            error=error,
+        )
+        self._results.append(result)
+        self._latency_over_time.append({
+            "time": datetime.now(),
+            "latency_ms": latency,
+            "success": success,
+        })
+    def _generate_prompt(self) -> str:
+        """Generate a prompt based on configuration."""
+        if self.config.prompt_length_distribution == "fixed":
+            return self.config.prompt
+        if self.config.prompt_length_distribution == "realistic":
+            # Simulate realistic prompt length distribution
+            prompts = [
+                "Hello!",
+                "Write a haiku about programming.",
+                "Explain quantum computing in simple terms.",
+                "Write a detailed technical analysis of transformer architectures and their impact on modern NLP systems.",
+                "Compare and contrast the approaches of different programming paradigms including object-oriented, functional, and procedural programming. Provide examples in Python for each.",
+            ]
+            import random
+            return random.choice(prompts)
+        if self.config.prompt_length_distribution == "random":
+            import random
+            words = ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog"]
+            length = random.randint(5, 100)
+            return " ".join(random.choices(words, k=length))
+        return self.config.prompt
+    def _get_progress(self) -> Dict[str, Any]:
+        """Get current progress."""
+        if not self._start_time:
+            return {}
+        elapsed = time.time() - self._start_time
+        successful = sum(1 for r in self._results if r.success)
+        latencies = [r.latency_ms for r in self._results if r.success]
+        return {
+            "elapsed_seconds": elapsed,
+            "total_requests": len(self._results),
+            "successful_requests": successful,
+            "failed_requests": len(self._results) - successful,
+            "avg_latency_ms": statistics.mean(latencies) if latencies else 0,
+            "current_rps": len(self._results) / elapsed if elapsed > 0 else 0,
+        }
+    def _analyze_results(self, test_id: str) -> LoadTestResult:
+        """Analyze test results."""
+        successful = [r for r in self._results if r.success]
+        failed = [r for r in self._results if not r.success]
+        latencies = [r.latency_ms for r in successful]
+        if not latencies:
+            return LoadTestResult(
+                test_id=test_id,
+                target_endpoint=self.config.target_endpoint,
+                concurrent_users=self.config.concurrent_users,
+                requests_per_second=self.config.requests_per_second,
+                duration_seconds=self.config.duration_seconds,
+                total_requests=len(self._results),
+                successful_requests=0,
+                failed_requests=len(failed),
+                avg_latency_ms=0,
+                p50_latency_ms=0,
+                p95_latency_ms=0,
+                p99_latency_ms=0,
+                throughput_rps=0,
+            )
+        # Calculate percentiles
+        sorted_latencies = sorted(latencies)
+        n = len(sorted_latencies)
+        p50 = sorted_latencies[int(n * 0.50)]
+        p95 = sorted_latencies[int(n * 0.95)]
+        p99 = sorted_latencies[min(int(n * 0.99), n - 1)]
+        # Calculate throughput
+        duration = self.config.duration_seconds
+        if self._start_time:
+            duration = time.time() - self._start_time
+        throughput = len(successful) / duration if duration > 0 else 0
+        # Detect saturation point
+        saturation = self._find_saturation_point()
+        return LoadTestResult(
+            test_id=test_id,
+            target_endpoint=self.config.target_endpoint,
+            concurrent_users=self.config.concurrent_users,
+            requests_per_second=self.config.requests_per_second,
+            duration_seconds=self.config.duration_seconds,
+            total_requests=len(self._results),
+            successful_requests=len(successful),
+            failed_requests=len(failed),
+            avg_latency_ms=statistics.mean(latencies),
+            p50_latency_ms=p50,
+            p95_latency_ms=p95,
+            p99_latency_ms=p99,
+            throughput_rps=throughput,
+            saturation_point=saturation,
+        )
+    def _find_saturation_point(self) -> Optional[float]:
+        """
+        Find the point where latency starts increasing dramatically.
+        Returns:
+            Request rate at saturation point, or None if not found
+        """
+        if len(self._latency_over_time) < 20:
+            return None
+        # Group latencies by time buckets
+        latencies = list(self._latency_over_time)
+        bucket_size = len(latencies) // 10
+        bucket_avgs = []
+        for i in range(0, len(latencies), bucket_size):
+            bucket = latencies[i : i + bucket_size]
+            if bucket:
+                avg = statistics.mean(r["latency_ms"] for r in bucket)
+                bucket_avgs.append(avg)
+        if len(bucket_avgs) < 3:
+            return None
+        # Look for significant increase (2x)
+        baseline = bucket_avgs[0]
+        for i, avg in enumerate(bucket_avgs):
+            if avg > baseline * 2:
+                # Estimate RPS at this point
+                elapsed = self.config.duration_seconds * (i / len(bucket_avgs))
+                return len(self._results) / elapsed if elapsed > 0 else None
+        return None
+    def get_latency_timeseries(self) -> List[Dict[str, Any]]:
+        """
+        Get latency over time for charting.
+        Returns:
+            List of {time, latency_ms} points
+        """
+        return [
+            {"time": p["time"], "latency_ms": p["latency_ms"]}
+            for p in self._latency_over_time
+        ]
+    def get_latency_histogram(self, bins: int = 20) -> Dict[str, Any]:
+        """
+        Get latency histogram data.
+        Args:
+            bins: Number of histogram bins
+        Returns:
+            Dictionary with bin edges and counts
+        """
+        latencies = [r.latency_ms for r in self._results if r.success]
+        if not latencies:
+            return {"bins": [], "counts": []}
+        counts, edges = np.histogram(latencies, bins=bins)
+        return {
+            "bins": [(edges[i] + edges[i + 1]) / 2 for i in range(len(counts))],
+            "counts": counts.tolist(),
+        }

services/request_tracer.py ADDED Viewed

	@@ -0,0 +1,272 @@

+"""Request tracing and latency analysis."""
+import logging
+import uuid
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Dict, List, Optional, Any
+from collections import deque
+import statistics
+from storage.database import MetricsDB
+from storage.models import RequestTrace
+logger = logging.getLogger(__name__)
+@dataclass
+class LatencyBreakdown:
+    """Breakdown of request latency by phase."""
+    queue_ms: float
+    prefill_ms: float
+    decode_ms: float
+    total_ms: float
+    @property
+    def as_dict(self) -> Dict[str, float]:
+        return {
+            "queue": self.queue_ms,
+            "prefill": self.prefill_ms,
+            "decode": self.decode_ms,
+            "total": self.total_ms,
+        }
+@dataclass
+class TraceCorrelation:
+    """Correlation analysis for a trace."""
+    memory_pressure: bool
+    likely_cause: str
+    memory_delta_gb: float
+class RequestTracer:
+    """Tracks and analyzes request latency."""
+    def __init__(self, db: Optional[MetricsDB] = None, p95_window: int = 100):
+        """
+        Initialize request tracer.
+        Args:
+            db: Optional database for persisting traces
+            p95_window: Number of recent requests for P95 calculation
+        """
+        self.db = db
+        self._traces: deque = deque(maxlen=1000)
+        self._latency_window: deque = deque(maxlen=p95_window)
+        self._baseline_p95: Optional[float] = None
+        self._slow_threshold_ms: Optional[float] = None
+    def record_trace(
+        self,
+        request_id: Optional[str] = None,
+        prompt_tokens: int = 0,
+        output_tokens: int = 0,
+        queue_time_ms: float = 0,
+        prefill_time_ms: float = 0,
+        decode_time_ms: float = 0,
+        total_time_ms: Optional[float] = None,
+        gpu_memory_start: float = 0,
+        gpu_memory_end: float = 0,
+    ) -> RequestTrace:
+        """
+        Record a request trace.
+        Args:
+            request_id: Unique request identifier
+            prompt_tokens: Number of prompt tokens
+            output_tokens: Number of output tokens
+            queue_time_ms: Time spent in queue
+            prefill_time_ms: Time for prefill/prompt processing
+            decode_time_ms: Time for token generation
+            total_time_ms: Total end-to-end time
+            gpu_memory_start: GPU memory at request start
+            gpu_memory_end: GPU memory at request end
+        Returns:
+            Created RequestTrace
+        """
+        if request_id is None:
+            request_id = str(uuid.uuid4())[:8]
+        if total_time_ms is None:
+            total_time_ms = queue_time_ms + prefill_time_ms + decode_time_ms
+        # Calculate tokens per second
+        tokens_per_sec = 0
+        if decode_time_ms > 0:
+            tokens_per_sec = (output_tokens / decode_time_ms) * 1000
+        # Determine if slow
+        is_slow = False
+        if self._slow_threshold_ms and total_time_ms > self._slow_threshold_ms:
+            is_slow = True
+        trace = RequestTrace(
+            request_id=request_id,
+            prompt_tokens=prompt_tokens,
+            output_tokens=output_tokens,
+            queue_time_ms=queue_time_ms,
+            prefill_time_ms=prefill_time_ms,
+            decode_time_ms=decode_time_ms,
+            total_time_ms=total_time_ms,
+            tokens_per_second=tokens_per_sec,
+            gpu_memory_at_start=gpu_memory_start,
+            gpu_memory_at_end=gpu_memory_end,
+            is_slow=is_slow,
+        )
+        # Store in memory
+        self._traces.append(trace)
+        self._latency_window.append(total_time_ms)
+        # Update P95 baseline
+        self._update_baseline()
+        # Persist to database
+        if self.db:
+            try:
+                self.db.insert_trace(trace)
+            except Exception as e:
+                logger.error(f"Error persisting trace: {e}")
+        # Log slow requests
+        if is_slow:
+            logger.warning(
+                f"Slow request {request_id}: {total_time_ms:.1f}ms "
+                f"(threshold: {self._slow_threshold_ms:.1f}ms)"
+            )
+        return trace
+    def _update_baseline(self) -> None:
+        """Update P95 baseline from recent requests."""
+        if len(self._latency_window) >= 10:
+            sorted_latencies = sorted(self._latency_window)
+            p95_idx = int(len(sorted_latencies) * 0.95)
+            self._baseline_p95 = sorted_latencies[p95_idx]
+            # Set slow threshold at 1.5x P95
+            self._slow_threshold_ms = self._baseline_p95 * 1.5
+    def get_recent_traces(
+        self, limit: int = 100, slow_only: bool = False
+    ) -> List[RequestTrace]:
+        """
+        Get recent traces.
+        Args:
+            limit: Maximum number of traces
+            slow_only: Only return slow requests
+        Returns:
+            List of RequestTrace objects
+        """
+        traces = list(self._traces)
+        if slow_only:
+            traces = [t for t in traces if t.is_slow]
+        return traces[-limit:]
+    def get_latency_breakdown(self) -> LatencyBreakdown:
+        """
+        Get average latency breakdown.
+        Returns:
+            LatencyBreakdown with average times
+        """
+        if not self._traces:
+            return LatencyBreakdown(0, 0, 0, 0)
+        recent = list(self._traces)[-100:]
+        return LatencyBreakdown(
+            queue_ms=statistics.mean(t.queue_time_ms for t in recent),
+            prefill_ms=statistics.mean(t.prefill_time_ms for t in recent),
+            decode_ms=statistics.mean(t.decode_time_ms for t in recent),
+            total_ms=statistics.mean(t.total_time_ms for t in recent),
+        )
+    def correlate_with_gpu_pressure(self, trace: RequestTrace) -> TraceCorrelation:
+        """
+        Correlate trace latency with GPU memory pressure.
+        Args:
+            trace: Request trace to analyze
+        Returns:
+            TraceCorrelation analysis
+        """
+        memory_delta = trace.gpu_memory_at_end - trace.gpu_memory_at_start
+        # Determine likely cause based on patterns
+        if memory_delta > 2.0:
+            cause = "batch_contention"
+        elif trace.queue_time_ms > trace.total_time_ms * 0.3:
+            cause = "queue_congestion"
+        elif trace.prefill_time_ms > trace.decode_time_ms * 2:
+            cause = "long_prompt"
+        else:
+            cause = "normal"
+        return TraceCorrelation(
+            memory_pressure=memory_delta > 1.0,
+            likely_cause=cause,
+            memory_delta_gb=memory_delta,
+        )
+    def get_percentiles(self) -> Dict[str, float]:
+        """
+        Get latency percentiles.
+        Returns:
+            Dictionary with P50, P95, P99 values
+        """
+        if not self._latency_window:
+            return {"p50": 0, "p95": 0, "p99": 0}
+        sorted_latencies = sorted(self._latency_window)
+        n = len(sorted_latencies)
+        return {
+            "p50": sorted_latencies[int(n * 0.50)],
+            "p95": sorted_latencies[int(n * 0.95)],
+            "p99": sorted_latencies[min(int(n * 0.99), n - 1)],
+        }
+    def get_stats(self) -> Dict[str, Any]:
+        """
+        Get comprehensive statistics.
+        Returns:
+            Dictionary with various stats
+        """
+        if not self._traces:
+            return {
+                "total_requests": 0,
+                "slow_requests": 0,
+                "avg_latency_ms": 0,
+                "percentiles": {"p50": 0, "p95": 0, "p99": 0},
+                "breakdown": {"queue": 0, "prefill": 0, "decode": 0},
+            }
+        traces = list(self._traces)
+        slow_count = sum(1 for t in traces if t.is_slow)
+        breakdown = self.get_latency_breakdown()
+        return {
+            "total_requests": len(traces),
+            "slow_requests": slow_count,
+            "slow_rate_percent": (slow_count / len(traces)) * 100,
+            "avg_latency_ms": breakdown.total_ms,
+            "percentiles": self.get_percentiles(),
+            "breakdown": breakdown.as_dict,
+            "baseline_p95": self._baseline_p95,
+        }
+    def clear(self) -> None:
+        """Clear all traces."""
+        self._traces.clear()
+        self._latency_window.clear()
+        self._baseline_p95 = None
+        self._slow_threshold_ms = None

storage/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""Storage layer for persistent metrics and traces."""
+from .database import MetricsDB
+from .models import MetricRecord, AlertRecord, RequestTrace
+__all__ = [
+    "MetricsDB",
+    "MetricRecord",
+    "AlertRecord",
+    "RequestTrace",
+]

storage/database.py ADDED Viewed

	@@ -0,0 +1,448 @@

+"""SQLite database operations for metrics storage."""
+import sqlite3
+import json
+import logging
+from datetime import datetime, timedelta
+from pathlib import Path
+from typing import List, Optional, Dict, Any
+from contextlib import contextmanager
+from .models import MetricRecord, AlertRecord, RequestTrace, LoadTestResult
+logger = logging.getLogger(__name__)
+class MetricsDB:
+    """SQLite database for storing metrics, alerts, and traces."""
+    def __init__(self, db_path: str = "data/metrics.db"):
+        """
+        Initialize database connection.
+        Args:
+            db_path: Path to SQLite database file
+        """
+        self.db_path = db_path
+        self._ensure_directory()
+        self._init_schema()
+    def _ensure_directory(self) -> None:
+        """Ensure the database directory exists."""
+        Path(self.db_path).parent.mkdir(parents=True, exist_ok=True)
+    @contextmanager
+    def _get_connection(self):
+        """Get a database connection with context manager."""
+        conn = sqlite3.connect(self.db_path)
+        conn.row_factory = sqlite3.Row
+        try:
+            yield conn
+            conn.commit()
+        except Exception:
+            conn.rollback()
+            raise
+        finally:
+            conn.close()
+    def _init_schema(self) -> None:
+        """Initialize database schema."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            # Metrics table
+            cursor.execute("""
+                CREATE TABLE IF NOT EXISTS metrics (
+                    id INTEGER PRIMARY KEY AUTOINCREMENT,
+                    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
+                    metric_name TEXT NOT NULL,
+                    value REAL NOT NULL,
+                    labels TEXT
+                )
+            """)
+            # Indexes for metrics
+            cursor.execute("""
+                CREATE INDEX IF NOT EXISTS idx_metrics_timestamp
+                ON metrics(timestamp)
+            """)
+            cursor.execute("""
+                CREATE INDEX IF NOT EXISTS idx_metrics_name_time
+                ON metrics(metric_name, timestamp)
+            """)
+            # Alerts table
+            cursor.execute("""
+                CREATE TABLE IF NOT EXISTS alerts (
+                    id INTEGER PRIMARY KEY AUTOINCREMENT,
+                    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
+                    rule_name TEXT,
+                    severity TEXT,
+                    metric_name TEXT,
+                    value REAL,
+                    threshold REAL,
+                    message TEXT,
+                    resolved_at DATETIME
+                )
+            """)
+            # Request traces table
+            cursor.execute("""
+                CREATE TABLE IF NOT EXISTS request_traces (
+                    id INTEGER PRIMARY KEY AUTOINCREMENT,
+                    request_id TEXT UNIQUE,
+                    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
+                    prompt_tokens INTEGER,
+                    output_tokens INTEGER,
+                    queue_time_ms REAL,
+                    prefill_time_ms REAL,
+                    decode_time_ms REAL,
+                    total_time_ms REAL,
+                    tokens_per_second REAL,
+                    is_slow BOOLEAN
+                )
+            """)
+            # Load test results table
+            cursor.execute("""
+                CREATE TABLE IF NOT EXISTS load_tests (
+                    id INTEGER PRIMARY KEY AUTOINCREMENT,
+                    test_id TEXT UNIQUE,
+                    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
+                    target_endpoint TEXT,
+                    concurrent_users INTEGER,
+                    requests_per_second REAL,
+                    duration_seconds INTEGER,
+                    total_requests INTEGER,
+                    successful_requests INTEGER,
+                    failed_requests INTEGER,
+                    avg_latency_ms REAL,
+                    p50_latency_ms REAL,
+                    p95_latency_ms REAL,
+                    p99_latency_ms REAL,
+                    throughput_rps REAL,
+                    saturation_point REAL
+                )
+            """)
+    # Metrics operations
+    def insert_metric(self, record: MetricRecord) -> int:
+        """Insert a metric record."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                INSERT INTO metrics (timestamp, metric_name, value, labels)
+                VALUES (?, ?, ?, ?)
+                """,
+                (
+                    record.timestamp.isoformat(),
+                    record.metric_name,
+                    record.value,
+                    json.dumps(record.labels) if record.labels else None,
+                ),
+            )
+            return cursor.lastrowid
+    def insert_metrics_batch(self, records: List[MetricRecord]) -> None:
+        """Insert multiple metric records efficiently."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.executemany(
+                """
+                INSERT INTO metrics (timestamp, metric_name, value, labels)
+                VALUES (?, ?, ?, ?)
+                """,
+                [
+                    (
+                        r.timestamp.isoformat(),
+                        r.metric_name,
+                        r.value,
+                        json.dumps(r.labels) if r.labels else None,
+                    )
+                    for r in records
+                ],
+            )
+    def query_metrics(
+        self,
+        metric_name: str,
+        start: datetime,
+        end: datetime,
+        labels: Optional[Dict[str, str]] = None,
+    ) -> List[MetricRecord]:
+        """Query metrics by name and time range."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                SELECT id, timestamp, metric_name, value, labels
+                FROM metrics
+                WHERE metric_name = ? AND timestamp BETWEEN ? AND ?
+                ORDER BY timestamp
+                """,
+                (metric_name, start.isoformat(), end.isoformat()),
+            )
+            records = []
+            for row in cursor.fetchall():
+                record = MetricRecord.from_row(tuple(row))
+                if labels:
+                    # Filter by labels if specified
+                    if all(record.labels.get(k) == v for k, v in labels.items()):
+                        records.append(record)
+                else:
+                    records.append(record)
+            return records
+    def query_aggregated(
+        self,
+        metric_name: str,
+        start: datetime,
+        end: datetime,
+        aggregation: str = "avg",
+        bucket_minutes: int = 1,
+    ) -> List[Dict[str, Any]]:
+        """Query metrics with time bucketing and aggregation."""
+        agg_func = {
+            "avg": "AVG",
+            "max": "MAX",
+            "min": "MIN",
+            "sum": "SUM",
+            "count": "COUNT",
+        }.get(aggregation, "AVG")
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                f"""
+                SELECT
+                    datetime(
+                        strftime('%Y-%m-%d %H:', timestamp) ||
+                        printf('%02d', (CAST(strftime('%M', timestamp) AS INTEGER) / {bucket_minutes}) * {bucket_minutes}) ||
+                        ':00'
+                    ) as bucket,
+                    {agg_func}(value) as value
+                FROM metrics
+                WHERE metric_name = ? AND timestamp BETWEEN ? AND ?
+                GROUP BY bucket
+                ORDER BY bucket
+                """,
+                (metric_name, start.isoformat(), end.isoformat()),
+            )
+            return [
+                {"time": row["bucket"], "value": row["value"]}
+                for row in cursor.fetchall()
+            ]
+    # Alert operations
+    def insert_alert(self, alert: AlertRecord) -> int:
+        """Insert an alert record."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                INSERT INTO alerts
+                (timestamp, rule_name, severity, metric_name, value, threshold, message, resolved_at)
+                VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+                """,
+                (
+                    alert.timestamp.isoformat(),
+                    alert.rule_name,
+                    alert.severity,
+                    alert.metric_name,
+                    alert.value,
+                    alert.threshold,
+                    alert.message,
+                    alert.resolved_at.isoformat() if alert.resolved_at else None,
+                ),
+            )
+            return cursor.lastrowid
+    def get_active_alerts(self) -> List[AlertRecord]:
+        """Get all unresolved alerts."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                SELECT id, timestamp, rule_name, severity, metric_name, value, threshold, message, resolved_at
+                FROM alerts
+                WHERE resolved_at IS NULL
+                ORDER BY timestamp DESC
+                """
+            )
+            return [AlertRecord.from_row(tuple(row)) for row in cursor.fetchall()]
+    def get_recent_alerts(self, limit: int = 100) -> List[AlertRecord]:
+        """Get recent alerts."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                SELECT id, timestamp, rule_name, severity, metric_name, value, threshold, message, resolved_at
+                FROM alerts
+                ORDER BY timestamp DESC
+                LIMIT ?
+                """,
+                (limit,),
+            )
+            return [AlertRecord.from_row(tuple(row)) for row in cursor.fetchall()]
+    def resolve_alert(self, alert_id: int) -> None:
+        """Mark an alert as resolved."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                UPDATE alerts SET resolved_at = ? WHERE id = ?
+                """,
+                (datetime.now().isoformat(), alert_id),
+            )
+    # Request trace operations
+    def insert_trace(self, trace: RequestTrace) -> int:
+        """Insert a request trace."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                INSERT OR REPLACE INTO request_traces
+                (request_id, timestamp, prompt_tokens, output_tokens,
+                 queue_time_ms, prefill_time_ms, decode_time_ms, total_time_ms,
+                 tokens_per_second, is_slow)
+                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                """,
+                (
+                    trace.request_id,
+                    trace.timestamp.isoformat(),
+                    trace.prompt_tokens,
+                    trace.output_tokens,
+                    trace.queue_time_ms,
+                    trace.prefill_time_ms,
+                    trace.decode_time_ms,
+                    trace.total_time_ms,
+                    trace.tokens_per_second,
+                    trace.is_slow,
+                ),
+            )
+            return cursor.lastrowid
+    def get_recent_traces(
+        self, limit: int = 100, slow_only: bool = False
+    ) -> List[RequestTrace]:
+        """Get recent request traces."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            query = """
+                SELECT id, request_id, timestamp, prompt_tokens, output_tokens,
+                       queue_time_ms, prefill_time_ms, decode_time_ms, total_time_ms,
+                       tokens_per_second, is_slow
+                FROM request_traces
+            """
+            if slow_only:
+                query += " WHERE is_slow = 1"
+            query += " ORDER BY timestamp DESC LIMIT ?"
+            cursor.execute(query, (limit,))
+            return [RequestTrace.from_row(tuple(row)) for row in cursor.fetchall()]
+    def get_trace_stats(self) -> Dict[str, Any]:
+        """Get aggregate statistics for traces."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                SELECT
+                    COUNT(*) as total,
+                    AVG(total_time_ms) as avg_latency,
+                    AVG(queue_time_ms) as avg_queue,
+                    AVG(prefill_time_ms) as avg_prefill,
+                    AVG(decode_time_ms) as avg_decode,
+                    SUM(CASE WHEN is_slow THEN 1 ELSE 0 END) as slow_count
+                FROM request_traces
+                WHERE timestamp > datetime('now', '-1 hour')
+                """
+            )
+            row = cursor.fetchone()
+            return {
+                "total_requests": row["total"] or 0,
+                "avg_latency_ms": row["avg_latency"] or 0,
+                "avg_queue_ms": row["avg_queue"] or 0,
+                "avg_prefill_ms": row["avg_prefill"] or 0,
+                "avg_decode_ms": row["avg_decode"] or 0,
+                "slow_request_count": row["slow_count"] or 0,
+            }
+    # Load test operations
+    def insert_load_test(self, result: LoadTestResult) -> int:
+        """Insert a load test result."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                INSERT INTO load_tests
+                (test_id, timestamp, target_endpoint, concurrent_users,
+                 requests_per_second, duration_seconds, total_requests,
+                 successful_requests, failed_requests, avg_latency_ms,
+                 p50_latency_ms, p95_latency_ms, p99_latency_ms,
+                 throughput_rps, saturation_point)
+                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                """,
+                (
+                    result.test_id,
+                    result.timestamp.isoformat(),
+                    result.target_endpoint,
+                    result.concurrent_users,
+                    result.requests_per_second,
+                    result.duration_seconds,
+                    result.total_requests,
+                    result.successful_requests,
+                    result.failed_requests,
+                    result.avg_latency_ms,
+                    result.p50_latency_ms,
+                    result.p95_latency_ms,
+                    result.p99_latency_ms,
+                    result.throughput_rps,
+                    result.saturation_point,
+                ),
+            )
+            return cursor.lastrowid
+    def get_recent_load_tests(self, limit: int = 10) -> List[Dict[str, Any]]:
+        """Get recent load test results."""
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            cursor.execute(
+                """
+                SELECT * FROM load_tests
+                ORDER BY timestamp DESC
+                LIMIT ?
+                """,
+                (limit,),
+            )
+            return [dict(row) for row in cursor.fetchall()]
+    # Cleanup operations
+    def cleanup_old_data(self, days: int = 7) -> int:
+        """Remove data older than specified days."""
+        cutoff = (datetime.now() - timedelta(days=days)).isoformat()
+        with self._get_connection() as conn:
+            cursor = conn.cursor()
+            total_deleted = 0
+            for table in ["metrics", "alerts", "request_traces"]:
+                cursor.execute(
+                    f"DELETE FROM {table} WHERE timestamp < ?",
+                    (cutoff,),
+                )
+                total_deleted += cursor.rowcount
+            return total_deleted

storage/models.py ADDED Viewed

	@@ -0,0 +1,165 @@

+"""Data models for storage layer."""
+from dataclasses import dataclass, field
+from datetime import datetime
+from typing import Optional, Dict, Any
+import json
+@dataclass
+class MetricRecord:
+    """A single metric record for storage."""
+    metric_name: str
+    value: float
+    timestamp: datetime = field(default_factory=datetime.now)
+    labels: Dict[str, str] = field(default_factory=dict)
+    id: Optional[int] = None
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "id": self.id,
+            "metric_name": self.metric_name,
+            "value": self.value,
+            "timestamp": self.timestamp.isoformat(),
+            "labels": self.labels,
+        }
+    @classmethod
+    def from_row(cls, row: tuple) -> "MetricRecord":
+        return cls(
+            id=row[0],
+            timestamp=datetime.fromisoformat(row[1]),
+            metric_name=row[2],
+            value=row[3],
+            labels=json.loads(row[4]) if row[4] else {},
+        )
+@dataclass
+class AlertRecord:
+    """An alert record for storage."""
+    rule_name: str
+    severity: str
+    metric_name: str
+    value: float
+    threshold: float
+    message: str
+    timestamp: datetime = field(default_factory=datetime.now)
+    resolved_at: Optional[datetime] = None
+    id: Optional[int] = None
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "id": self.id,
+            "rule_name": self.rule_name,
+            "severity": self.severity,
+            "metric_name": self.metric_name,
+            "value": self.value,
+            "threshold": self.threshold,
+            "message": self.message,
+            "timestamp": self.timestamp.isoformat(),
+            "resolved_at": self.resolved_at.isoformat() if self.resolved_at else None,
+        }
+    @classmethod
+    def from_row(cls, row: tuple) -> "AlertRecord":
+        return cls(
+            id=row[0],
+            timestamp=datetime.fromisoformat(row[1]),
+            rule_name=row[2],
+            severity=row[3],
+            metric_name=row[4],
+            value=row[5],
+            threshold=row[6],
+            message=row[7] if len(row) > 7 else "",
+            resolved_at=datetime.fromisoformat(row[8]) if len(row) > 8 and row[8] else None,
+        )
+@dataclass
+class RequestTrace:
+    """A request trace for latency analysis."""
+    request_id: str
+    prompt_tokens: int
+    output_tokens: int
+    queue_time_ms: float
+    prefill_time_ms: float
+    decode_time_ms: float
+    total_time_ms: float
+    tokens_per_second: float
+    gpu_memory_at_start: float = 0.0
+    gpu_memory_at_end: float = 0.0
+    is_slow: bool = False
+    timestamp: datetime = field(default_factory=datetime.now)
+    id: Optional[int] = None
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "id": self.id,
+            "request_id": self.request_id,
+            "timestamp": self.timestamp.isoformat(),
+            "prompt_tokens": self.prompt_tokens,
+            "output_tokens": self.output_tokens,
+            "queue_time_ms": round(self.queue_time_ms, 2),
+            "prefill_time_ms": round(self.prefill_time_ms, 2),
+            "decode_time_ms": round(self.decode_time_ms, 2),
+            "total_time_ms": round(self.total_time_ms, 2),
+            "tokens_per_second": round(self.tokens_per_second, 2),
+            "is_slow": self.is_slow,
+        }
+    @classmethod
+    def from_row(cls, row: tuple) -> "RequestTrace":
+        return cls(
+            id=row[0],
+            request_id=row[1],
+            timestamp=datetime.fromisoformat(row[2]),
+            prompt_tokens=row[3],
+            output_tokens=row[4],
+            queue_time_ms=row[5],
+            prefill_time_ms=row[6],
+            decode_time_ms=row[7],
+            total_time_ms=row[8],
+            tokens_per_second=row[9] if len(row) > 9 else 0,
+            is_slow=bool(row[10]) if len(row) > 10 else False,
+        )
+@dataclass
+class LoadTestResult:
+    """Results from a load test run."""
+    test_id: str
+    target_endpoint: str
+    concurrent_users: int
+    requests_per_second: float
+    duration_seconds: int
+    total_requests: int
+    successful_requests: int
+    failed_requests: int
+    avg_latency_ms: float
+    p50_latency_ms: float
+    p95_latency_ms: float
+    p99_latency_ms: float
+    throughput_rps: float
+    saturation_point: Optional[float] = None
+    timestamp: datetime = field(default_factory=datetime.now)
+    id: Optional[int] = None
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "test_id": self.test_id,
+            "target_endpoint": self.target_endpoint,
+            "concurrent_users": self.concurrent_users,
+            "requests_per_second": self.requests_per_second,
+            "duration_seconds": self.duration_seconds,
+            "total_requests": self.total_requests,
+            "successful_requests": self.successful_requests,
+            "failed_requests": self.failed_requests,
+            "avg_latency_ms": round(self.avg_latency_ms, 2),
+            "p50_latency_ms": round(self.p50_latency_ms, 2),
+            "p95_latency_ms": round(self.p95_latency_ms, 2),
+            "p99_latency_ms": round(self.p99_latency_ms, 2),
+            "throughput_rps": round(self.throughput_rps, 2),
+            "saturation_point": self.saturation_point,
+            "timestamp": self.timestamp.isoformat(),
+        }

utils/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""Utility modules for the dashboard."""
+from .prometheus_parser import parse_prometheus_metrics
+from .history import MetricHistory
+__all__ = ["parse_prometheus_metrics", "MetricHistory"]

utils/history.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""In-memory metric history buffer for time-series data."""
+from collections import deque
+from dataclasses import dataclass, field
+from datetime import datetime
+from typing import Dict, List, Any, Optional
+import threading
+@dataclass
+class HistoryPoint:
+    """A single point in metric history."""
+    timestamp: datetime
+    value: float
+    labels: Dict[str, str] = field(default_factory=dict)
+class MetricHistory:
+    """
+    Thread-safe in-memory buffer for metric history.
+    Maintains a rolling window of metric values for charting.
+    """
+    def __init__(self, max_length: int = 300):
+        """
+        Initialize history buffer.
+        Args:
+            max_length: Maximum number of points to retain
+        """
+        self.max_length = max_length
+        self._data: Dict[str, deque] = {}
+        self._lock = threading.Lock()
+    def add(self, metric_name: str, value: float, labels: Optional[Dict[str, str]] = None) -> None:
+        """
+        Add a data point to the history.
+        Args:
+            metric_name: Name of the metric
+            value: Metric value
+            labels: Optional labels for the metric
+        """
+        point = HistoryPoint(
+            timestamp=datetime.now(),
+            value=value,
+            labels=labels or {}
+        )
+        # Create key including labels for differentiation
+        key = self._make_key(metric_name, labels)
+        with self._lock:
+            if key not in self._data:
+                self._data[key] = deque(maxlen=self.max_length)
+            self._data[key].append(point)
+    def get(
+        self,
+        metric_name: str,
+        labels: Optional[Dict[str, str]] = None,
+        limit: Optional[int] = None
+    ) -> List[HistoryPoint]:
+        """
+        Get history for a metric.
+        Args:
+            metric_name: Name of the metric
+            labels: Optional label filter
+            limit: Maximum number of points to return
+        Returns:
+            List of history points
+        """
+        key = self._make_key(metric_name, labels)
+        with self._lock:
+            if key not in self._data:
+                return []
+            points = list(self._data[key])
+            if limit:
+                points = points[-limit:]
+            return points
+    def get_latest(
+        self,
+        metric_name: str,
+        labels: Optional[Dict[str, str]] = None
+    ) -> Optional[HistoryPoint]:
+        """Get the most recent value for a metric."""
+        points = self.get(metric_name, labels, limit=1)
+        return points[-1] if points else None
+    def get_all_series(self, metric_name: str) -> Dict[str, List[HistoryPoint]]:
+        """
+        Get all label combinations for a metric.
+        Args:
+            metric_name: Base metric name
+        Returns:
+            Dictionary mapping label strings to history lists
+        """
+        result = {}
+        prefix = f"{metric_name}:"
+        with self._lock:
+            for key, points in self._data.items():
+                if key == metric_name or key.startswith(prefix):
+                    result[key] = list(points)
+        return result
+    def to_dataframe(self, metric_name: str, labels: Optional[Dict[str, str]] = None):
+        """
+        Convert history to pandas DataFrame.
+        Args:
+            metric_name: Name of the metric
+            labels: Optional label filter
+        Returns:
+            pandas DataFrame with time and value columns
+        """
+        import pandas as pd
+        points = self.get(metric_name, labels)
+        if not points:
+            return pd.DataFrame(columns=["time", "value"])
+        return pd.DataFrame([
+            {"time": p.timestamp, "value": p.value, **p.labels}
+            for p in points
+        ])
+    def clear(self, metric_name: Optional[str] = None) -> None:
+        """
+        Clear history.
+        Args:
+            metric_name: If provided, clear only this metric; otherwise clear all
+        """
+        with self._lock:
+            if metric_name:
+                keys_to_remove = [
+                    k for k in self._data.keys()
+                    if k == metric_name or k.startswith(f"{metric_name}:")
+                ]
+                for key in keys_to_remove:
+                    del self._data[key]
+            else:
+                self._data.clear()
+    def _make_key(self, metric_name: str, labels: Optional[Dict[str, str]]) -> str:
+        """Create a unique key from metric name and labels."""
+        if not labels:
+            return metric_name
+        label_str = ",".join(f"{k}={v}" for k, v in sorted(labels.items()))
+        return f"{metric_name}:{label_str}"

utils/prometheus_parser.py ADDED Viewed

	@@ -0,0 +1,195 @@

+"""Parser for Prometheus text format metrics."""
+import re
+from typing import Dict, List, Any, Optional
+from dataclasses import dataclass
+@dataclass
+class MetricSample:
+    """A single metric sample with labels and value."""
+    name: str
+    labels: Dict[str, str]
+    value: float
+    timestamp: Optional[float] = None
+def parse_prometheus_metrics(text: str) -> Dict[str, List[MetricSample]]:
+    """
+    Parse Prometheus text format into structured metrics.
+    Args:
+        text: Raw Prometheus metrics text
+    Returns:
+        Dictionary mapping metric names to lists of samples
+    """
+    metrics: Dict[str, List[MetricSample]] = {}
+    for line in text.strip().split("\n"):
+        line = line.strip()
+        # Skip empty lines and comments
+        if not line or line.startswith("#"):
+            continue
+        # Parse metric line
+        sample = _parse_metric_line(line)
+        if sample:
+            if sample.name not in metrics:
+                metrics[sample.name] = []
+            metrics[sample.name].append(sample)
+    return metrics
+def _parse_metric_line(line: str) -> Optional[MetricSample]:
+    """Parse a single Prometheus metric line."""
+    # Pattern: metric_name{label1="value1",label2="value2"} value [timestamp]
+    # Or: metric_name value [timestamp]
+    # Match with labels
+    match = re.match(
+        r'^([a-zA-Z_:][a-zA-Z0-9_:]*)\{([^}]*)\}\s+([^\s]+)(?:\s+(\d+))?$',
+        line
+    )
+    if match:
+        name = match.group(1)
+        labels_str = match.group(2)
+        value_str = match.group(3)
+        timestamp_str = match.group(4)
+        labels = _parse_labels(labels_str)
+        value = _parse_value(value_str)
+        timestamp = float(timestamp_str) if timestamp_str else None
+        return MetricSample(name=name, labels=labels, value=value, timestamp=timestamp)
+    # Match without labels
+    match = re.match(
+        r'^([a-zA-Z_:][a-zA-Z0-9_:]*)\s+([^\s]+)(?:\s+(\d+))?$',
+        line
+    )
+    if match:
+        name = match.group(1)
+        value_str = match.group(2)
+        timestamp_str = match.group(3)
+        value = _parse_value(value_str)
+        timestamp = float(timestamp_str) if timestamp_str else None
+        return MetricSample(name=name, labels={}, value=value, timestamp=timestamp)
+    return None
+def _parse_labels(labels_str: str) -> Dict[str, str]:
+    """Parse label string into dictionary."""
+    labels = {}
+    # Pattern: key="value"
+    for match in re.finditer(r'([a-zA-Z_][a-zA-Z0-9_]*)="([^"]*)"', labels_str):
+        labels[match.group(1)] = match.group(2)
+    return labels
+def _parse_value(value_str: str) -> float:
+    """Parse metric value, handling special cases."""
+    if value_str.lower() == "nan":
+        return float("nan")
+    if value_str.lower() == "+inf":
+        return float("inf")
+    if value_str.lower() == "-inf":
+        return float("-inf")
+    return float(value_str)
+def get_metric_value(
+    metrics: Dict[str, List[MetricSample]],
+    name: str,
+    labels: Optional[Dict[str, str]] = None
+) -> Optional[float]:
+    """
+    Get a specific metric value by name and optional labels.
+    Args:
+        metrics: Parsed metrics dictionary
+        name: Metric name
+        labels: Optional label filter
+    Returns:
+        Metric value or None if not found
+    """
+    if name not in metrics:
+        return None
+    for sample in metrics[name]:
+        if labels is None:
+            return sample.value
+        # Check if all specified labels match
+        if all(sample.labels.get(k) == v for k, v in labels.items()):
+            return sample.value
+    return None
+def get_histogram_quantile(
+    metrics: Dict[str, List[MetricSample]],
+    name: str,
+    quantile: float,
+    labels: Optional[Dict[str, str]] = None
+) -> Optional[float]:
+    """
+    Get histogram quantile value from Prometheus histogram.
+    Args:
+        metrics: Parsed metrics dictionary
+        name: Base metric name (without _bucket suffix)
+        quantile: Desired quantile (e.g., 0.95 for P95)
+        labels: Optional label filter
+    Returns:
+        Approximate quantile value or None
+    """
+    bucket_name = f"{name}_bucket"
+    if bucket_name not in metrics:
+        return None
+    # Get all buckets
+    buckets = []
+    for sample in metrics[bucket_name]:
+        if labels and not all(sample.labels.get(k) == v for k, v in labels.items()):
+            continue
+        le = sample.labels.get("le")
+        if le and le != "+Inf":
+            buckets.append((float(le), sample.value))
+    if not buckets:
+        return None
+    # Sort by bucket boundary
+    buckets.sort(key=lambda x: x[0])
+    # Get total count
+    total = buckets[-1][1] if buckets else 0
+    if total == 0:
+        return None
+    # Find bucket containing quantile
+    target = quantile * total
+    prev_bound = 0
+    prev_count = 0
+    for bound, count in buckets:
+        if count >= target:
+            # Linear interpolation within bucket
+            fraction = (target - prev_count) / (count - prev_count) if count > prev_count else 0
+            return prev_bound + fraction * (bound - prev_bound)
+        prev_bound = bound
+        prev_count = count
+    return buckets[-1][0] if buckets else None