giuto's picture
Enhance README with Grafana dashboard configuration, data drift detection details, and usage instructions
94af946
|
raw
history blame
5.64 kB

Metrics Collection & Verification

This directory contains the configuration for Prometheus monitoring.

Configuration

  • Prometheus Config: prometheus/prometheus.yml
  • Scrape Target: hopcroft-api:8080
  • Metrics Endpoint: http://localhost:8080/metrics

Verification Queries (PromQL)

You can run these queries in the Prometheus Expression Browser (http://localhost:9090/graph):

1. Request Rate (Counter)

Shows the rate of requests per second over the last minute.

rate(hopcroft_requests_total[1m])

2. Average Request Duration (Histogram)

Calculates average latency.

rate(hopcroft_request_duration_seconds_sum[5m]) / rate(hopcroft_request_duration_seconds_count[5m])

3. Current In-Progress Requests (Gauge)

Shows how many requests are currently being processed.

hopcroft_in_progress_requests

4. Model Prediction Time (Summary)

Shows the 90th percentile of model prediction time.

hopcroft_prediction_processing_seconds{quantile="0.9"}

Uptime Monitoring (Better Stack)

We used Better Stack Uptime to monitor the availability of the production deployment hosted on Hugging Face Spaces.

Base URL

Monitored endpoints

Checks and alerts

  • Monitors are configured to run from multiple locations.

  • Email notifications are enabled for failures.

  • A failure scenario was tested to confirm Better Stack reports the server error details.

  • Screenshots are available in monitoring/screenshots/.


Grafana Dashboard

Grafana provides real-time visualization of system metrics and drift detection status.

Configuration

  • Port: 3000
  • Credentials: admin / admin
  • Dashboard: Hopcroft Monitoring Dashboard
  • Datasource: Prometheus (auto-provisioned)
  • Provisioning Files:
    • Datasources: grafana/provisioning/datasources/prometheus.yml
    • Dashboards: grafana/provisioning/dashboards/dashboard.yml
    • Dashboard JSON: grafana/dashboards/hopcroft_dashboard.json

Dashboard Panels

  1. API Request Rate: Rate of incoming requests per endpoint
  2. API Latency: Average response time per endpoint
  3. Drift Detection Status: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
  4. Drift P-Value: Statistical significance of detected drift
  5. Drift Distance: Kolmogorov-Smirnov distance metric

Access

Navigate to http://localhost:3000 and login with the provided credentials. The dashboard refreshes every 10 seconds.


Data Drift Detection

Automated distribution shift detection using statistical testing to monitor model input data quality.

Algorithm

  • Method: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
  • Baseline Data: 1000 samples from training set
  • Detection Threshold: p-value < 0.05 (with Bonferroni correction)
  • Metrics Published: drift_detected, drift_p_value, drift_distance, drift_check_timestamp

Scripts

Baseline Preparation

Script: drift/scripts/prepare_baseline.py

Functionality:

  • Loads data from SQLite database (data/raw/skillscope_data.db)
  • Extracts numeric features only
  • Samples 1000 representative records
  • Saves to drift/baseline/reference_data.pkl

Usage:

cd monitoring/drift/scripts
python prepare_baseline.py

Drift Detection

Script: drift/scripts/run_drift_check.py

Functionality:

  • Loads baseline reference data
  • Compares with new production data
  • Performs KS test on each feature
  • Pushes metrics to Pushgateway
  • Saves results to drift/reports/

Usage:

cd monitoring/drift/scripts
python run_drift_check.py

Verification

Check Pushgateway metrics:

curl http://localhost:9091/metrics | grep drift

Query in Prometheus:

drift_detected
drift_p_value
drift_distance

Pushgateway

Pushgateway collects metrics from short-lived jobs such as the drift detection script.

Configuration

  • Port: 9091
  • Persistence: Enabled with 5-minute intervals
  • Data Volume: pushgateway-data

Metrics Endpoint

Access metrics at http://localhost:9091/metrics

Integration

The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.


Alerting

Alert rules are defined in prometheus/alert_rules.yml:

  • High Latency: Triggered when average latency exceeds 2 seconds
  • High Error Rate: Triggered when error rate exceeds 5%
  • Data Drift Detected: Triggered when drift_detected = 1

Alerts are routed to Alertmanager (http://localhost:9093) and can be configured to send notifications via email, Slack, or other channels in alertmanager/config.yml.


Complete Stack Usage

Starting All Services

# Start all monitoring services
docker compose up -d

# Verify all containers are running
docker compose ps

# Check Prometheus targets
curl http://localhost:9090/targets

# Check Grafana health
curl http://localhost:3000/api/health

Running Drift Detection Workflow

  1. Prepare Baseline (One-time setup)
cd monitoring/drift/scripts
python prepare_baseline.py
  1. Execute Drift Check
python run_drift_check.py
  1. Verify Results
    • Check Pushgateway: http://localhost:9091
    • Check Prometheus: http://localhost:9090/graph
    • Check Grafana: http://localhost:3000