Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Sleeping

App Files Files Community

Vito Proscia commited on Dec 22, 2025

Commit

4ba57df

unverified ·

2 Parent(s): 1708201 94af946

Merge pull request #32 from se4ai2526-uniba/grafana-drift-monitoring

Browse files

Files changed (10) hide show

docker-compose.yml +48 -0
monitoring/README.md +150 -0
monitoring/drift/scripts/prepare_baseline.py +119 -0
monitoring/drift/scripts/requirements.txt +6 -0
monitoring/drift/scripts/run_drift_check.py +216 -0
monitoring/grafana/dashboards/hopcroft_dashboard.json +358 -0
monitoring/grafana/provisioning/dashboards/dashboard.yml +13 -0
monitoring/grafana/provisioning/dashboards/hopcroft_dashboard.json +358 -0
monitoring/grafana/provisioning/datasources/prometheus.yml +14 -0
monitoring/prometheus/prometheus.yml +15 -0

docker-compose.yml CHANGED Viewed

@@ -72,6 +72,50 @@ services:
       - hopcroft-net
     restart: unless-stopped
 networks:
   hopcroft-net:
     driver: bridge
@@ -79,3 +123,7 @@ networks:
 volumes:
   hopcroft-logs:
     driver: local

       - hopcroft-net
     restart: unless-stopped
+  grafana:
+    image: grafana/grafana:latest
+    container_name: grafana
+    ports:
+      - "3000:3000"
+    environment:
+      - GF_SECURITY_ADMIN_USER=admin
+      - GF_SECURITY_ADMIN_PASSWORD=admin
+      - GF_USERS_ALLOW_SIGN_UP=false
+      - GF_SERVER_ROOT_URL=http://localhost:3000
+    volumes:
+      # Provisioning: auto-configure datasources and dashboards
+      - ./monitoring/grafana/provisioning/datasources:/etc/grafana/provisioning/datasources
+      - ./monitoring/grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards
+      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
+      # Persistent storage for Grafana data
+      - grafana-data:/var/lib/grafana
+    networks:
+      - hopcroft-net
+    depends_on:
+      - prometheus
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+  pushgateway:
+    image: prom/pushgateway:latest
+    container_name: pushgateway
+    ports:
+      - "9091:9091"
+    networks:
+      - hopcroft-net
+    restart: unless-stopped
+    command:
+      - '--web.listen-address=:9091'
+      - '--persistence.file=/data/pushgateway.data'
+      - '--persistence.interval=5m'
+    volumes:
+      - pushgateway-data:/data
 networks:
   hopcroft-net:
     driver: bridge
 volumes:
   hopcroft-logs:
     driver: local
+  grafana-data:
+    driver: local
+  pushgateway-data:
+    driver: local

monitoring/README.md CHANGED Viewed

@@ -55,3 +55,153 @@ We used Better Stack Uptime to monitor the availability of the production deploy
 - A failure scenario was tested to confirm Better Stack reports the server error details.
 - Screenshots are available in `monitoring/screenshots/`.

 - A failure scenario was tested to confirm Better Stack reports the server error details.
 - Screenshots are available in `monitoring/screenshots/`.
+---
+## Grafana Dashboard
+Grafana provides real-time visualization of system metrics and drift detection status.
+### Configuration
+- **Port**: `3000`
+- **Credentials**: `admin` / `admin`
+- **Dashboard**: Hopcroft Monitoring Dashboard
+- **Datasource**: Prometheus (auto-provisioned)
+- **Provisioning Files**:
+  - Datasources: `grafana/provisioning/datasources/prometheus.yml`
+  - Dashboards: `grafana/provisioning/dashboards/dashboard.yml`
+  - Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json`
+### Dashboard Panels
+1. **API Request Rate**: Rate of incoming requests per endpoint
+2. **API Latency**: Average response time per endpoint
+3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
+4. **Drift P-Value**: Statistical significance of detected drift
+5. **Drift Distance**: Kolmogorov-Smirnov distance metric
+### Access
+Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds.
+---
+## Data Drift Detection
+Automated distribution shift detection using statistical testing to monitor model input data quality.
+### Algorithm
+- **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
+- **Baseline Data**: 1000 samples from training set
+- **Detection Threshold**: p-value < 0.05 (with Bonferroni correction)
+- **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp
+### Scripts
+#### Baseline Preparation
+**Script**: `drift/scripts/prepare_baseline.py`
+Functionality:
+- Loads data from SQLite database (`data/raw/skillscope_data.db`)
+- Extracts numeric features only
+- Samples 1000 representative records
+- Saves to `drift/baseline/reference_data.pkl`
+Usage:
+```bash
+cd monitoring/drift/scripts
+python prepare_baseline.py
+```
+#### Drift Detection
+**Script**: `drift/scripts/run_drift_check.py`
+Functionality:
+- Loads baseline reference data
+- Compares with new production data
+- Performs KS test on each feature
+- Pushes metrics to Pushgateway
+- Saves results to `drift/reports/`
+Usage:
+```bash
+cd monitoring/drift/scripts
+python run_drift_check.py
+```
+### Verification
+Check Pushgateway metrics:
+```bash
+curl http://localhost:9091/metrics | grep drift
+```
+Query in Prometheus:
+```promql
+drift_detected
+drift_p_value
+drift_distance
+```
+---
+## Pushgateway
+Pushgateway collects metrics from short-lived jobs such as the drift detection script.
+### Configuration
+- **Port**: `9091`
+- **Persistence**: Enabled with 5-minute intervals
+- **Data Volume**: `pushgateway-data`
+### Metrics Endpoint
+Access metrics at `http://localhost:9091/metrics`
+### Integration
+The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.
+---
+## Alerting
+Alert rules are defined in `prometheus/alert_rules.yml`:
+- **High Latency**: Triggered when average latency exceeds 2 seconds
+- **High Error Rate**: Triggered when error rate exceeds 5%
+- **Data Drift Detected**: Triggered when drift_detected = 1
+Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`.
+---
+## Complete Stack Usage
+### Starting All Services
+```bash
+# Start all monitoring services
+docker compose up -d
+# Verify all containers are running
+docker compose ps
+# Check Prometheus targets
+curl http://localhost:9090/targets
+# Check Grafana health
+curl http://localhost:3000/api/health
+```
+### Running Drift Detection Workflow
+1. **Prepare Baseline (One-time setup)**
+```bash
+cd monitoring/drift/scripts
+python prepare_baseline.py
+```
+2. **Execute Drift Check**
+```bash
+python run_drift_check.py
+```
+3. **Verify Results**
+   - Check Pushgateway: `http://localhost:9091`
+   - Check Prometheus: `http://localhost:9090/graph`
+   - Check Grafana: `http://localhost:3000`

monitoring/drift/scripts/prepare_baseline.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""
+Prepare baseline/reference data for drift detection.
+This script samples representative data from the training set.
+"""
+import pickle
+import pandas as pd
+import numpy as np
+import sqlite3
+from pathlib import Path
+from sklearn.model_selection import train_test_split
+# Paths
+PROJECT_ROOT = Path(__file__).parent.parent.parent.parent
+BASELINE_DIR = Path(__file__).parent.parent / "baseline"
+BASELINE_DIR.mkdir(parents=True, exist_ok=True)
+def load_training_data():
+    """Load the original training dataset from SQLite database."""
+    # Load from SQLite database
+    db_path = PROJECT_ROOT / "data" / "raw" / "skillscope_data.db"
+    if not db_path.exists():
+        raise FileNotFoundError(f"Database not found at {db_path}")
+    print(f"Loading data from database: {db_path}")
+    conn = sqlite3.connect(db_path)
+    # Load from the main table
+    query = "SELECT * FROM nlbse_tool_competition_data_by_issue LIMIT 10000"
+    df = pd.read_sql_query(query, conn)
+    conn.close()
+    print(f"Loaded {len(df)} training samples")
+    return df
+def prepare_baseline(df, sample_size=1000, random_state=42):
+    """
+    Sample representative baseline data.
+    Args:
+        df: Training dataframe
+        sample_size: Number of samples for baseline
+        random_state: Random seed for reproducibility
+    Returns:
+        Baseline dataframe
+    """
+    # Stratified sampling if you have labels
+    if 'label' in df.columns:
+        _, baseline_df = train_test_split(
+            df,
+            test_size=sample_size,
+            random_state=random_state,
+            stratify=df['label']
+        )
+    else:
+        baseline_df = df.sample(n=min(sample_size, len(df)), random_state=random_state)
+    print(f"Sampled {len(baseline_df)} baseline samples")
+    return baseline_df
+def extract_features(df):
+    """
+    Extract features used for drift detection.
+    Should match the features used by your model.
+    """
+    # Select only numeric columns, exclude labels and IDs
+    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
+    exclude_cols = ['label', 'id', 'timestamp', 'issue_id', 'file_id', 'method_id', 'class_id']
+    feature_columns = [col for col in numeric_cols if col not in exclude_cols]
+    X = df[feature_columns].values
+    print(f"Extracted {X.shape[1]} numeric features from {X.shape[0]} samples")
+    return X
+def save_baseline(baseline_data, filename="reference_data.pkl"):
+    """Save baseline data to disk."""
+    baseline_path = BASELINE_DIR / filename
+    with open(baseline_path, 'wb') as f:
+        pickle.dump(baseline_data, f)
+    print(f"Baseline saved to {baseline_path}")
+    print(f"   Shape: {baseline_data.shape}")
+    print(f"   Size: {baseline_path.stat().st_size / 1024:.2f} KB")
+def main():
+    """Main execution."""
+    print("=" * 60)
+    print("Preparing Baseline Data for Drift Detection")
+    print("=" * 60)
+    # Load data
+    df = load_training_data()
+    # Sample baseline
+    baseline_df = prepare_baseline(df, sample_size=1000)
+    # Extract features
+    X_baseline = extract_features(baseline_df)
+    # Save
+    save_baseline(X_baseline)
+    print("\n" + "=" * 60)
+    print("Baseline preparation complete!")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()

monitoring/drift/scripts/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+alibi-detect>=0.11.4
+pandas>=2.0.0
+numpy>=1.24.0
+scikit-learn>=1.3.0
+requests>=2.31.0
+mlflow>=2.8.0

monitoring/drift/scripts/run_drift_check.py ADDED Viewed

	@@ -0,0 +1,216 @@

+"""
+Data Drift Detection using Scipy KS Test.
+Detects distribution shifts between baseline and new data.
+"""
+import pickle
+import json
+import requests
+import numpy as np
+import pandas as pd
+from pathlib import Path
+from datetime import datetime
+from scipy.stats import ks_2samp
+from typing import Dict, Tuple
+# Configuration
+PROJECT_ROOT = Path(__file__).parent.parent.parent.parent
+BASELINE_DIR = Path(__file__).parent.parent / "baseline"
+REPORTS_DIR = Path(__file__).parent.parent / "reports"
+REPORTS_DIR.mkdir(parents=True, exist_ok=True)
+PUSHGATEWAY_URL = "http://localhost:9091"
+P_VALUE_THRESHOLD = 0.05  # Significance level
+def load_baseline() -> np.ndarray:
+    """Load reference/baseline data."""
+    baseline_path = BASELINE_DIR / "reference_data.pkl"
+    if not baseline_path.exists():
+        raise FileNotFoundError(
+            f"Baseline data not found at {baseline_path}\n"
+            f"Run `python prepare_baseline.py` first!"
+        )
+    with open(baseline_path, 'rb') as f:
+        X_baseline = pickle.load(f)
+    print(f"Loaded baseline data: {X_baseline.shape}")
+    return X_baseline
+def load_new_data() -> np.ndarray:
+    """
+    Load new/production data to check for drift.
+    In production, this would fetch from:
+    - Database
+    - S3 bucket
+    - API logs
+    - Data lake
+    For now, simulate or load from file.
+    """
+    # Option 1: Load from file
+    data_path = PROJECT_ROOT / "data" / "test.csv"
+    if data_path.exists():
+        df = pd.read_csv(data_path)
+        # Extract same features as baseline
+        feature_columns = [col for col in df.columns if col not in ['label', 'id', 'timestamp']]
+        X_new = df[feature_columns].values[:500]  # Take 500 samples
+        print(f"Loaded new data from file: {X_new.shape}")
+        return X_new
+    # Option 2: Simulate (for testing)
+    print("Simulating new data (no test file found)")
+    X_baseline = load_baseline()
+    # Add slight shift to simulate drift
+    X_new = X_baseline[:500] + np.random.normal(0, 0.1, (500, X_baseline.shape[1]))
+    return X_new
+def run_drift_detection(X_baseline: np.ndarray, X_new: np.ndarray) -> Dict:
+    """
+    Run Kolmogorov-Smirnov drift detection using scipy.
+    Args:
+        X_baseline: Reference data
+        X_new: New data to check
+    Returns:
+        Drift detection results
+    """
+    print("\n" + "=" * 60)
+    print("Running Drift Detection (Kolmogorov-Smirnov Test)")
+    print("=" * 60)
+    # Run KS test for each feature
+    p_values = []
+    distances = []
+    for i in range(X_baseline.shape[1]):
+        statistic, p_value = ks_2samp(X_baseline[:, i], X_new[:, i])
+        p_values.append(p_value)
+        distances.append(statistic)
+    # Aggregate results
+    min_p_value = np.min(p_values)
+    max_distance = np.max(distances)
+    # Apply Bonferroni correction for multiple testing
+    adjusted_threshold = P_VALUE_THRESHOLD / X_baseline.shape[1]
+    drift_detected = min_p_value < adjusted_threshold
+    # Extract results
+    results = {
+        "timestamp": datetime.now().isoformat(),
+        "drift_detected": int(drift_detected),
+        "p_value": float(min_p_value),
+        "threshold": adjusted_threshold,
+        "distance": float(max_distance),
+        "baseline_samples": X_baseline.shape[0],
+        "new_samples": X_new.shape[0],
+        "num_features": X_baseline.shape[1]
+    }
+    # Print results
+    print(f"\nResults:")
+    print(f"   Drift Detected: {'YES' if results['drift_detected'] else 'NO'}")
+    print(f"   P-Value: {results['p_value']:.6f} (adjusted threshold: {adjusted_threshold:.6f})")
+    print(f"   Distance: {results['distance']:.6f}")
+    print(f"   Baseline: {X_baseline.shape[0]} samples")
+    print(f"   New Data: {X_new.shape[0]} samples")
+    return results
+def save_report(results: Dict):
+    """Save drift detection report to file."""
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    report_path = REPORTS_DIR / f"drift_report_{timestamp}.json"
+    with open(report_path, 'w') as f:
+        json.dump(results, f, indent=2)
+    print(f"\nReport saved to: {report_path}")
+def push_to_prometheus(results: Dict):
+    """
+    Push drift metrics to Prometheus via Pushgateway.
+    This allows Prometheus to scrape short-lived job metrics.
+    """
+    metrics = f"""# TYPE drift_detected gauge
+# HELP drift_detected Whether data drift was detected (1=yes, 0=no)
+drift_detected {results['drift_detected']}
+# TYPE drift_p_value gauge
+# HELP drift_p_value P-value from drift detection test
+drift_p_value {results['p_value']}
+# TYPE drift_distance gauge
+# HELP drift_distance Statistical distance between distributions
+drift_distance {results['distance']}
+# TYPE drift_check_timestamp gauge
+# HELP drift_check_timestamp Unix timestamp of last drift check
+drift_check_timestamp {datetime.now().timestamp()}
+"""
+    try:
+        response = requests.post(
+            f"{PUSHGATEWAY_URL}/metrics/job/drift_detection/instance/hopcroft",
+            data=metrics,
+            headers={'Content-Type': 'text/plain'}
+        )
+        response.raise_for_status()
+        print(f"Metrics pushed to Pushgateway at {PUSHGATEWAY_URL}")
+    except requests.exceptions.RequestException as e:
+        print(f"Failed to push to Pushgateway: {e}")
+        print(f"   Make sure Pushgateway is running: docker compose ps pushgateway")
+def main():
+    """Main execution."""
+    print("\n" + "=" * 60)
+    print("Hopcroft Data Drift Detection")
+    print("=" * 60)
+    try:
+        # Load data
+        X_baseline = load_baseline()
+        X_new = load_new_data()
+        # Run drift detection
+        results = run_drift_detection(X_baseline, X_new)
+        # Save report
+        save_report(results)
+        # Push to Prometheus
+        push_to_prometheus(results)
+        print("\n" + "=" * 60)
+        print("Drift Detection Complete!")
+        print("=" * 60)
+        if results['drift_detected']:
+            print("\nWARNING: Data drift detected!")
+            print(f"   P-value: {results['p_value']:.6f} < {P_VALUE_THRESHOLD}")
+            return 1
+        else:
+            print("\nNo significant drift detected")
+            return 0
+    except Exception as e:
+        print(f"\nError: {e}")
+        import traceback
+        traceback.print_exc()
+        return 1
+if __name__ == "__main__":
+    exit(main())

monitoring/grafana/dashboards/hopcroft_dashboard.json ADDED Viewed

	@@ -0,0 +1,358 @@

+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": "-- Grafana --",
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "gnetId": null,
+  "graphTooltip": 1,
+  "id": null,
+  "links": [],
+  "panels": [
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "reqps"
+        }
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 6,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "rate(fastapi_requests_total[1m])",
+          "refId": "A"
+        }
+      ],
+      "title": "Request Rate",
+      "type": "gauge",
+      "description": "Number of requests per second handled by the API"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": true
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "ms"
+        }
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 18,
+        "x": 6,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": ["mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, rate(fastapi_request_duration_seconds_bucket[5m])) * 1000",
+          "legendFormat": "p95",
+          "refId": "A"
+        },
+        {
+          "expr": "histogram_quantile(0.50, rate(fastapi_request_duration_seconds_bucket[5m])) * 1000",
+          "legendFormat": "p50 (median)",
+          "refId": "B"
+        }
+      ],
+      "title": "Request Latency (p50, p95)",
+      "type": "timeseries",
+      "description": "API response time percentiles over time"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [
+            {
+              "options": {
+                "0": {
+                  "color": "red",
+                  "index": 1,
+                  "text": "No Drift"
+                },
+                "1": {
+                  "color": "green",
+                  "index": 0,
+                  "text": "Drift Detected"
+                }
+              },
+              "type": "value"
+            }
+          ],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          }
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 6,
+        "x": 0,
+        "y": 8
+      },
+      "id": 3,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true,
+        "text": {}
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_detected",
+          "refId": "A"
+        }
+      ],
+      "title": "Data Drift Status",
+      "type": "stat",
+      "description": "Current data drift detection status (1 = drift detected, 0 = no drift)"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "decimals": 4,
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 0.01
+              },
+              {
+                "color": "red",
+                "value": 0.05
+              }
+            ]
+          },
+          "unit": "short"
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 6,
+        "x": 6,
+        "y": 8
+      },
+      "id": 4,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true,
+        "text": {}
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_p_value",
+          "refId": "A"
+        }
+      ],
+      "title": "Drift P-Value",
+      "type": "stat",
+      "description": "Statistical significance of detected drift (lower = more significant)"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "auto",
+            "spanNulls": false
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "short"
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 12,
+        "x": 12,
+        "y": 8
+      },
+      "id": 5,
+      "options": {
+        "legend": {
+          "calcs": ["mean", "lastNotNull"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_distance",
+          "legendFormat": "Distance",
+          "refId": "A"
+        }
+      ],
+      "title": "Drift Distance Over Time",
+      "type": "timeseries",
+      "description": "Statistical distance between baseline and current data distribution"
+    }
+  ],
+  "refresh": "10s",
+  "schemaVersion": 36,
+  "style": "dark",
+  "tags": ["hopcroft", "ml", "monitoring"],
+  "templating": {
+    "list": []
+  },
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "Hopcroft ML Model Monitoring",
+  "uid": "hopcroft-ml-dashboard",
+  "version": 1,
+  "weekStart": ""
+}

monitoring/grafana/provisioning/dashboards/dashboard.yml ADDED Viewed

	@@ -0,0 +1,13 @@

+apiVersion: 1
+providers:
+  - name: 'Hopcroft Dashboards'
+    orgId: 1
+    folder: ''
+    type: file
+    disableDeletion: false
+    updateIntervalSeconds: 10
+    allowUiUpdates: true
+    options:
+      path: /var/lib/grafana/dashboards
+      foldersFromFilesStructure: true

monitoring/grafana/provisioning/dashboards/hopcroft_dashboard.json ADDED Viewed

	@@ -0,0 +1,358 @@

+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": "-- Grafana --",
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "gnetId": null,
+  "graphTooltip": 1,
+  "id": null,
+  "links": [],
+  "panels": [
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "reqps"
+        }
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 6,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "rate(fastapi_requests_total[1m])",
+          "refId": "A"
+        }
+      ],
+      "title": "Request Rate",
+      "type": "gauge",
+      "description": "Number of requests per second handled by the API"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": true
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "ms"
+        }
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 18,
+        "x": 6,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": ["mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, rate(fastapi_request_duration_seconds_bucket[5m])) * 1000",
+          "legendFormat": "p95",
+          "refId": "A"
+        },
+        {
+          "expr": "histogram_quantile(0.50, rate(fastapi_request_duration_seconds_bucket[5m])) * 1000",
+          "legendFormat": "p50 (median)",
+          "refId": "B"
+        }
+      ],
+      "title": "Request Latency (p50, p95)",
+      "type": "timeseries",
+      "description": "API response time percentiles over time"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [
+            {
+              "options": {
+                "0": {
+                  "color": "red",
+                  "index": 1,
+                  "text": "No Drift"
+                },
+                "1": {
+                  "color": "green",
+                  "index": 0,
+                  "text": "Drift Detected"
+                }
+              },
+              "type": "value"
+            }
+          ],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          }
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 6,
+        "x": 0,
+        "y": 8
+      },
+      "id": 3,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true,
+        "text": {}
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_detected",
+          "refId": "A"
+        }
+      ],
+      "title": "Data Drift Status",
+      "type": "stat",
+      "description": "Current data drift detection status (1 = drift detected, 0 = no drift)"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "decimals": 4,
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 0.01
+              },
+              {
+                "color": "red",
+                "value": 0.05
+              }
+            ]
+          },
+          "unit": "short"
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 6,
+        "x": 6,
+        "y": 8
+      },
+      "id": 4,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true,
+        "text": {}
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_p_value",
+          "refId": "A"
+        }
+      ],
+      "title": "Drift P-Value",
+      "type": "stat",
+      "description": "Statistical significance of detected drift (lower = more significant)"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "auto",
+            "spanNulls": false
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "short"
+        }
+      },
+      "gridPos": {
+        "h": 6,
+        "w": 12,
+        "x": 12,
+        "y": 8
+      },
+      "id": 5,
+      "options": {
+        "legend": {
+          "calcs": ["mean", "lastNotNull"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "drift_distance",
+          "legendFormat": "Distance",
+          "refId": "A"
+        }
+      ],
+      "title": "Drift Distance Over Time",
+      "type": "timeseries",
+      "description": "Statistical distance between baseline and current data distribution"
+    }
+  ],
+  "refresh": "10s",
+  "schemaVersion": 36,
+  "style": "dark",
+  "tags": ["hopcroft", "ml", "monitoring"],
+  "templating": {
+    "list": []
+  },
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "Hopcroft ML Model Monitoring",
+  "uid": "hopcroft-ml-dashboard",
+  "version": 1,
+  "weekStart": ""
+}

monitoring/grafana/provisioning/datasources/prometheus.yml ADDED Viewed

	@@ -0,0 +1,14 @@

+apiVersion: 1
+datasources:
+  - name: Prometheus
+    type: prometheus
+    access: proxy
+    uid: prometheus
+    orgId: 1
+    url: http://prometheus:9090
+    isDefault: true
+    editable: true
+    jsonData:
+      httpMethod: POST
+      timeInterval: "15s"

monitoring/prometheus/prometheus.yml CHANGED Viewed

@@ -1,6 +1,9 @@
 global:
   scrape_interval: 15s
   evaluation_interval: 15s
 rule_files:
   - "alert_rules.yml"
@@ -13,5 +16,17 @@ alerting:
 scrape_configs:
   - job_name: 'hopcroft-api'
     static_configs:
       - targets: ['hopcroft-api:8080']

 global:
   scrape_interval: 15s
   evaluation_interval: 15s
+  external_labels:
+    monitor: 'hopcroft-monitor'
+    environment: 'development'
 rule_files:
   - "alert_rules.yml"
 scrape_configs:
   - job_name: 'hopcroft-api'
+    metrics_path: '/metrics'
     static_configs:
       - targets: ['hopcroft-api:8080']
+    scrape_interval: 10s
+  - job_name: 'prometheus'
+    static_configs:
+      - targets: ['localhost:9090']
+  - job_name: 'pushgateway'
+    honor_labels: true
+    static_configs:
+      - targets: ['pushgateway:9091']
+    scrape_interval: 30s