Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Sleeping

App Files Files Community

giuto commited on Dec 22, 2025

Commit

94af946

1 Parent(s): 1396866

Enhance README with Grafana dashboard configuration, data drift detection details, and usage instructions

Browse files

Files changed (1) hide show

monitoring/README.md +150 -0

monitoring/README.md CHANGED Viewed

@@ -55,3 +55,153 @@ We used Better Stack Uptime to monitor the availability of the production deploy
 - A failure scenario was tested to confirm Better Stack reports the server error details.
 - Screenshots are available in `monitoring/screenshots/`.

 - A failure scenario was tested to confirm Better Stack reports the server error details.
 - Screenshots are available in `monitoring/screenshots/`.
+---
+## Grafana Dashboard
+Grafana provides real-time visualization of system metrics and drift detection status.
+### Configuration
+- **Port**: `3000`
+- **Credentials**: `admin` / `admin`
+- **Dashboard**: Hopcroft Monitoring Dashboard
+- **Datasource**: Prometheus (auto-provisioned)
+- **Provisioning Files**:
+  - Datasources: `grafana/provisioning/datasources/prometheus.yml`
+  - Dashboards: `grafana/provisioning/dashboards/dashboard.yml`
+  - Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json`
+### Dashboard Panels
+1. **API Request Rate**: Rate of incoming requests per endpoint
+2. **API Latency**: Average response time per endpoint
+3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
+4. **Drift P-Value**: Statistical significance of detected drift
+5. **Drift Distance**: Kolmogorov-Smirnov distance metric
+### Access
+Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds.
+---
+## Data Drift Detection
+Automated distribution shift detection using statistical testing to monitor model input data quality.
+### Algorithm
+- **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
+- **Baseline Data**: 1000 samples from training set
+- **Detection Threshold**: p-value < 0.05 (with Bonferroni correction)
+- **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp
+### Scripts
+#### Baseline Preparation
+**Script**: `drift/scripts/prepare_baseline.py`
+Functionality:
+- Loads data from SQLite database (`data/raw/skillscope_data.db`)
+- Extracts numeric features only
+- Samples 1000 representative records
+- Saves to `drift/baseline/reference_data.pkl`
+Usage:
+```bash
+cd monitoring/drift/scripts
+python prepare_baseline.py
+```
+#### Drift Detection
+**Script**: `drift/scripts/run_drift_check.py`
+Functionality:
+- Loads baseline reference data
+- Compares with new production data
+- Performs KS test on each feature
+- Pushes metrics to Pushgateway
+- Saves results to `drift/reports/`
+Usage:
+```bash
+cd monitoring/drift/scripts
+python run_drift_check.py
+```
+### Verification
+Check Pushgateway metrics:
+```bash
+curl http://localhost:9091/metrics | grep drift
+```
+Query in Prometheus:
+```promql
+drift_detected
+drift_p_value
+drift_distance
+```
+---
+## Pushgateway
+Pushgateway collects metrics from short-lived jobs such as the drift detection script.
+### Configuration
+- **Port**: `9091`
+- **Persistence**: Enabled with 5-minute intervals
+- **Data Volume**: `pushgateway-data`
+### Metrics Endpoint
+Access metrics at `http://localhost:9091/metrics`
+### Integration
+The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.
+---
+## Alerting
+Alert rules are defined in `prometheus/alert_rules.yml`:
+- **High Latency**: Triggered when average latency exceeds 2 seconds
+- **High Error Rate**: Triggered when error rate exceeds 5%
+- **Data Drift Detected**: Triggered when drift_detected = 1
+Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`.
+---
+## Complete Stack Usage
+### Starting All Services
+```bash
+# Start all monitoring services
+docker compose up -d
+# Verify all containers are running
+docker compose ps
+# Check Prometheus targets
+curl http://localhost:9090/targets
+# Check Grafana health
+curl http://localhost:3000/api/health
+```
+### Running Drift Detection Workflow
+1. **Prepare Baseline (One-time setup)**
+```bash
+cd monitoring/drift/scripts
+python prepare_baseline.py
+```
+2. **Execute Drift Check**
+```bash
+python run_drift_check.py
+```
+3. **Verify Results**
+   - Check Pushgateway: `http://localhost:9091`
+   - Check Prometheus: `http://localhost:9090/graph`
+   - Check Grafana: `http://localhost:3000`