Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

File size: 5,834 Bytes

# Metrics Collection & Verification

This directory contains the configuration for Prometheus monitoring.

## Configuration
- **Prometheus Config**: `prometheus/prometheus.yml`
- **Scrape Target**: `hopcroft-api:8080`
- **Metrics Endpoint**: `http://localhost:8080/metrics`

## Verification Queries (PromQL)

You can run these queries in the Prometheus Expression Browser (`http://localhost:9090/graph`):

### 1. Request Rate (Counter)
Shows the rate of requests per second over the last minute.
```promql
rate(hopcroft_requests_total[1m])
```

### 2. Average Request Duration (Histogram)
Calculates average latency.
```promql
rate(hopcroft_request_duration_seconds_sum[5m]) / rate(hopcroft_request_duration_seconds_count[5m])
```

### 3. Current In-Progress Requests (Gauge)
Shows how many requests are currently being processed.
```promql
hopcroft_in_progress_requests
```

### 4. Model Prediction Time (Summary)
Shows the 90th percentile of model prediction time.
```promql
hopcroft_prediction_processing_seconds{quantile="0.9"}
```

---

## Uptime Monitoring (Better Stack)

We used Better Stack Uptime to monitor the availability of the production deployment hosted on Hugging Face Spaces.

**Base URL**
- https://dacrow13-hopcroft-skill-classification.hf.space

**Monitored endpoints**
- https://dacrow13-hopcroft-skill-classification.hf.space/health
- https://dacrow13-hopcroft-skill-classification.hf.space/openapi.json
- https://dacrow13-hopcroft-skill-classification.hf.space/docs

## Prometheus on Hugging Face Space

Prometheus is also running directly on the Hugging Face Space and is accessible at:
- https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/

**Checks and alerts**
- Monitors are configured to run from multiple locations.
- Email notifications are enabled for failures.
- A failure scenario was tested to confirm Better Stack reports the server error details.

- Screenshots are available in `monitoring/screenshots/`.

---

## Grafana Dashboard

Grafana provides real-time visualization of system metrics and drift detection status.

### Configuration
- **Port**: `3000`
- **Credentials**: `admin` / `admin`
- **Dashboard**: Hopcroft Monitoring Dashboard
- **Datasource**: Prometheus (auto-provisioned)
- **Provisioning Files**:
  - Datasources: `grafana/provisioning/datasources/prometheus.yml`
  - Dashboards: `grafana/provisioning/dashboards/dashboard.yml`
  - Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json`

### Dashboard Panels
1. **API Request Rate**: Rate of incoming requests per endpoint
2. **API Latency**: Average response time per endpoint
3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
4. **Drift P-Value**: Statistical significance of detected drift
5. **Drift Distance**: Kolmogorov-Smirnov distance metric

### Access
Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds.

---

## Data Drift Detection

Automated distribution shift detection using statistical testing to monitor model input data quality.

### Algorithm
- **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
- **Baseline Data**: 1000 samples from training set
- **Detection Threshold**: p-value < 0.05 (with Bonferroni correction)
- **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp

### Scripts

#### Baseline Preparation
**Script**: `drift/scripts/prepare_baseline.py`

Functionality:
- Loads data from SQLite database (`data/raw/skillscope_data.db`)
- Extracts numeric features only
- Samples 1000 representative records
- Saves to `drift/baseline/reference_data.pkl`

Usage:
```bash
cd monitoring/drift/scripts
python prepare_baseline.py
```

#### Drift Detection
**Script**: `drift/scripts/run_drift_check.py`

Functionality:
- Loads baseline reference data
- Compares with new production data
- Performs KS test on each feature
- Pushes metrics to Pushgateway
- Saves results to `drift/reports/`

Usage:
```bash
cd monitoring/drift/scripts
python run_drift_check.py
```

### Verification
Check Pushgateway metrics:
```bash
curl http://localhost:9091/metrics | grep drift
```

Query in Prometheus:
```promql
drift_detected
drift_p_value
drift_distance
```

---

## Pushgateway

Pushgateway collects metrics from short-lived jobs such as the drift detection script.

### Configuration
- **Port**: `9091`
- **Persistence**: Enabled with 5-minute intervals
- **Data Volume**: `pushgateway-data`

### Metrics Endpoint
Access metrics at `http://localhost:9091/metrics`

### Integration
The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.

---

## Alerting

Alert rules are defined in `prometheus/alert_rules.yml`:

- **High Latency**: Triggered when average latency exceeds 2 seconds
- **High Error Rate**: Triggered when error rate exceeds 5%
- **Data Drift Detected**: Triggered when drift_detected = 1

Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`.

---

## Complete Stack Usage

### Starting All Services
```bash
# Start all monitoring services
docker compose up -d

# Verify all containers are running
docker compose ps

# Check Prometheus targets
curl http://localhost:9090/targets

# Check Grafana health
curl http://localhost:3000/api/health
```

### Running Drift Detection Workflow

1. **Prepare Baseline (One-time setup)**
```bash
cd monitoring/drift/scripts
python prepare_baseline.py
```

2. **Execute Drift Check**
```bash
python run_drift_check.py
```

3. **Verify Results**
   - Check Pushgateway: `http://localhost:9091`
   - Check Prometheus: `http://localhost:9090/graph`
   - Check Grafana: `http://localhost:3000`