| # Metrics Collection & Verification | |
| This directory contains the configuration for Prometheus monitoring. | |
| ## Configuration | |
| - **Prometheus Config**: `prometheus/prometheus.yml` | |
| - **Scrape Target**: `hopcroft-api:8080` | |
| - **Metrics Endpoint**: `http://localhost:8080/metrics` | |
| ## Verification Queries (PromQL) | |
| You can run these queries in the Prometheus Expression Browser (`http://localhost:9090/graph`): | |
| ### 1. Request Rate (Counter) | |
| Shows the rate of requests per second over the last minute. | |
| ```promql | |
| rate(hopcroft_requests_total[1m]) | |
| ``` | |
| ### 2. Average Request Duration (Histogram) | |
| Calculates average latency. | |
| ```promql | |
| rate(hopcroft_request_duration_seconds_sum[5m]) / rate(hopcroft_request_duration_seconds_count[5m]) | |
| ``` | |
| ### 3. Current In-Progress Requests (Gauge) | |
| Shows how many requests are currently being processed. | |
| ```promql | |
| hopcroft_in_progress_requests | |
| ``` | |
| ### 4. Model Prediction Time (Summary) | |
| Shows the 90th percentile of model prediction time. | |
| ```promql | |
| hopcroft_prediction_processing_seconds{quantile="0.9"} | |
| ``` | |
| --- | |
| ## Uptime Monitoring (Better Stack) | |
| We used Better Stack Uptime to monitor the availability of the production deployment hosted on Hugging Face Spaces. | |
| **Base URL** | |
| - https://dacrow13-hopcroft-skill-classification.hf.space | |
| **Monitored endpoints** | |
| - https://dacrow13-hopcroft-skill-classification.hf.space/health | |
| - https://dacrow13-hopcroft-skill-classification.hf.space/openapi.json | |
| - https://dacrow13-hopcroft-skill-classification.hf.space/docs | |
| ## Prometheus on Hugging Face Space | |
| Prometheus is also running directly on the Hugging Face Space and is accessible at: | |
| - https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/ | |
| **Checks and alerts** | |
| - Monitors are configured to run from multiple locations. | |
| - Email notifications are enabled for failures. | |
| - A failure scenario was tested to confirm Better Stack reports the server error details. | |
| - Screenshots are available in `monitoring/screenshots/`. | |
| --- | |
| ## Grafana Dashboard | |
| Grafana provides real-time visualization of system metrics and drift detection status. | |
| ### Configuration | |
| - **Port**: `3000` | |
| - **Credentials**: `admin` / `admin` | |
| - **Dashboard**: Hopcroft Monitoring Dashboard | |
| - **Datasource**: Prometheus (auto-provisioned) | |
| - **Provisioning Files**: | |
| - Datasources: `grafana/provisioning/datasources/prometheus.yml` | |
| - Dashboards: `grafana/provisioning/dashboards/dashboard.yml` | |
| - Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json` | |
| ### Dashboard Panels | |
| 1. **API Request Rate**: Rate of incoming requests per endpoint | |
| 2. **API Latency**: Average response time per endpoint | |
| 3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected) | |
| 4. **Drift P-Value**: Statistical significance of detected drift | |
| 5. **Drift Distance**: Kolmogorov-Smirnov distance metric | |
| ### Access | |
| Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds. | |
| --- | |
| ## Data Drift Detection | |
| Automated distribution shift detection using statistical testing to monitor model input data quality. | |
| ### Algorithm | |
| - **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based) | |
| - **Baseline Data**: 1000 samples from training set | |
| - **Detection Threshold**: p-value < 0.05 (with Bonferroni correction) | |
| - **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp | |
| ### Scripts | |
| #### Baseline Preparation | |
| **Script**: `drift/scripts/prepare_baseline.py` | |
| Functionality: | |
| - Loads data from SQLite database (`data/raw/skillscope_data.db`) | |
| - Extracts numeric features only | |
| - Samples 1000 representative records | |
| - Saves to `drift/baseline/reference_data.pkl` | |
| Usage: | |
| ```bash | |
| cd monitoring/drift/scripts | |
| python prepare_baseline.py | |
| ``` | |
| #### Drift Detection | |
| **Script**: `drift/scripts/run_drift_check.py` | |
| Functionality: | |
| - Loads baseline reference data | |
| - Compares with new production data | |
| - Performs KS test on each feature | |
| - Pushes metrics to Pushgateway | |
| - Saves results to `drift/reports/` | |
| Usage: | |
| ```bash | |
| cd monitoring/drift/scripts | |
| python run_drift_check.py | |
| ``` | |
| ### Verification | |
| Check Pushgateway metrics: | |
| ```bash | |
| curl http://localhost:9091/metrics | grep drift | |
| ``` | |
| Query in Prometheus: | |
| ```promql | |
| drift_detected | |
| drift_p_value | |
| drift_distance | |
| ``` | |
| --- | |
| ## Pushgateway | |
| Pushgateway collects metrics from short-lived jobs such as the drift detection script. | |
| ### Configuration | |
| - **Port**: `9091` | |
| - **Persistence**: Enabled with 5-minute intervals | |
| - **Data Volume**: `pushgateway-data` | |
| ### Metrics Endpoint | |
| Access metrics at `http://localhost:9091/metrics` | |
| ### Integration | |
| The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana. | |
| --- | |
| ## Alerting | |
| Alert rules are defined in `prometheus/alert_rules.yml`: | |
| - **High Latency**: Triggered when average latency exceeds 2 seconds | |
| - **High Error Rate**: Triggered when error rate exceeds 5% | |
| - **Data Drift Detected**: Triggered when drift_detected = 1 | |
| Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`. | |
| --- | |
| ## Complete Stack Usage | |
| ### Starting All Services | |
| ```bash | |
| # Start all monitoring services | |
| docker compose up -d | |
| # Verify all containers are running | |
| docker compose ps | |
| # Check Prometheus targets | |
| curl http://localhost:9090/targets | |
| # Check Grafana health | |
| curl http://localhost:3000/api/health | |
| ``` | |
| ### Running Drift Detection Workflow | |
| 1. **Prepare Baseline (One-time setup)** | |
| ```bash | |
| cd monitoring/drift/scripts | |
| python prepare_baseline.py | |
| ``` | |
| 2. **Execute Drift Check** | |
| ```bash | |
| python run_drift_check.py | |
| ``` | |
| 3. **Verify Results** | |
| - Check Pushgateway: `http://localhost:9091` | |
| - Check Prometheus: `http://localhost:9090/graph` | |
| - Check Grafana: `http://localhost:3000` | |