# Metrics Collection & Verification This directory contains the configuration for Prometheus monitoring. ## Configuration - **Prometheus Config**: `prometheus/prometheus.yml` - **Scrape Target**: `hopcroft-api:8080` - **Metrics Endpoint**: `http://localhost:8080/metrics` ## Verification Queries (PromQL) You can run these queries in the Prometheus Expression Browser (`http://localhost:9090/graph`): ### 1. Request Rate (Counter) Shows the rate of requests per second over the last minute. ```promql rate(hopcroft_requests_total[1m]) ``` ### 2. Average Request Duration (Histogram) Calculates average latency. ```promql rate(hopcroft_request_duration_seconds_sum[5m]) / rate(hopcroft_request_duration_seconds_count[5m]) ``` ### 3. Current In-Progress Requests (Gauge) Shows how many requests are currently being processed. ```promql hopcroft_in_progress_requests ``` ### 4. Model Prediction Time (Summary) Shows the 90th percentile of model prediction time. ```promql hopcroft_prediction_processing_seconds{quantile="0.9"} ``` --- ## Uptime Monitoring (Better Stack) We used Better Stack Uptime to monitor the availability of the production deployment hosted on Hugging Face Spaces. **Base URL** - https://dacrow13-hopcroft-skill-classification.hf.space **Monitored endpoints** - https://dacrow13-hopcroft-skill-classification.hf.space/health - https://dacrow13-hopcroft-skill-classification.hf.space/openapi.json - https://dacrow13-hopcroft-skill-classification.hf.space/docs ## Prometheus on Hugging Face Space Prometheus is also running directly on the Hugging Face Space and is accessible at: - https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/ **Checks and alerts** - Monitors are configured to run from multiple locations. - Email notifications are enabled for failures. - A failure scenario was tested to confirm Better Stack reports the server error details. - Screenshots are available in `monitoring/screenshots/`. --- ## Grafana Dashboard Grafana provides real-time visualization of system metrics and drift detection status. ### Configuration - **Port**: `3000` - **Credentials**: `admin` / `admin` - **Dashboard**: Hopcroft Monitoring Dashboard - **Datasource**: Prometheus (auto-provisioned) - **Provisioning Files**: - Datasources: `grafana/provisioning/datasources/prometheus.yml` - Dashboards: `grafana/provisioning/dashboards/dashboard.yml` - Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json` ### Dashboard Panels 1. **API Request Rate**: Rate of incoming requests per endpoint 2. **API Latency**: Average response time per endpoint 3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected) 4. **Drift P-Value**: Statistical significance of detected drift 5. **Drift Distance**: Kolmogorov-Smirnov distance metric ### Access Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds. --- ## Data Drift Detection Automated distribution shift detection using statistical testing to monitor model input data quality. ### Algorithm - **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based) - **Baseline Data**: 1000 samples from training set - **Detection Threshold**: p-value < 0.05 (with Bonferroni correction) - **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp ### Scripts #### Baseline Preparation **Script**: `drift/scripts/prepare_baseline.py` Functionality: - Loads data from SQLite database (`data/raw/skillscope_data.db`) - Extracts numeric features only - Samples 1000 representative records - Saves to `drift/baseline/reference_data.pkl` Usage: ```bash cd monitoring/drift/scripts python prepare_baseline.py ``` #### Drift Detection **Script**: `drift/scripts/run_drift_check.py` Functionality: - Loads baseline reference data - Compares with new production data - Performs KS test on each feature - Pushes metrics to Pushgateway - Saves results to `drift/reports/` Usage: ```bash cd monitoring/drift/scripts python run_drift_check.py ``` ### Verification Check Pushgateway metrics: ```bash curl http://localhost:9091/metrics | grep drift ``` Query in Prometheus: ```promql drift_detected drift_p_value drift_distance ``` --- ## Pushgateway Pushgateway collects metrics from short-lived jobs such as the drift detection script. ### Configuration - **Port**: `9091` - **Persistence**: Enabled with 5-minute intervals - **Data Volume**: `pushgateway-data` ### Metrics Endpoint Access metrics at `http://localhost:9091/metrics` ### Integration The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana. --- ## Alerting Alert rules are defined in `prometheus/alert_rules.yml`: - **High Latency**: Triggered when average latency exceeds 2 seconds - **High Error Rate**: Triggered when error rate exceeds 5% - **Data Drift Detected**: Triggered when drift_detected = 1 Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`. --- ## Complete Stack Usage ### Starting All Services ```bash # Start all monitoring services docker compose up -d # Verify all containers are running docker compose ps # Check Prometheus targets curl http://localhost:9090/targets # Check Grafana health curl http://localhost:3000/api/health ``` ### Running Drift Detection Workflow 1. **Prepare Baseline (One-time setup)** ```bash cd monitoring/drift/scripts python prepare_baseline.py ``` 2. **Execute Drift Check** ```bash python run_drift_check.py ``` 3. **Verify Results** - Check Pushgateway: `http://localhost:9091` - Check Prometheus: `http://localhost:9090/graph` - Check Grafana: `http://localhost:3000`