DaCrow13
Configura Prometheus per girare su Hugging Face Space
c9732a0
# Metrics Collection & Verification
This directory contains the configuration for Prometheus monitoring.
## Configuration
- **Prometheus Config**: `prometheus/prometheus.yml`
- **Scrape Target**: `hopcroft-api:8080`
- **Metrics Endpoint**: `http://localhost:8080/metrics`
## Verification Queries (PromQL)
You can run these queries in the Prometheus Expression Browser (`http://localhost:9090/graph`):
### 1. Request Rate (Counter)
Shows the rate of requests per second over the last minute.
```promql
rate(hopcroft_requests_total[1m])
```
### 2. Average Request Duration (Histogram)
Calculates average latency.
```promql
rate(hopcroft_request_duration_seconds_sum[5m]) / rate(hopcroft_request_duration_seconds_count[5m])
```
### 3. Current In-Progress Requests (Gauge)
Shows how many requests are currently being processed.
```promql
hopcroft_in_progress_requests
```
### 4. Model Prediction Time (Summary)
Shows the 90th percentile of model prediction time.
```promql
hopcroft_prediction_processing_seconds{quantile="0.9"}
```
---
## Uptime Monitoring (Better Stack)
We used Better Stack Uptime to monitor the availability of the production deployment hosted on Hugging Face Spaces.
**Base URL**
- https://dacrow13-hopcroft-skill-classification.hf.space
**Monitored endpoints**
- https://dacrow13-hopcroft-skill-classification.hf.space/health
- https://dacrow13-hopcroft-skill-classification.hf.space/openapi.json
- https://dacrow13-hopcroft-skill-classification.hf.space/docs
## Prometheus on Hugging Face Space
Prometheus is also running directly on the Hugging Face Space and is accessible at:
- https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/
**Checks and alerts**
- Monitors are configured to run from multiple locations.
- Email notifications are enabled for failures.
- A failure scenario was tested to confirm Better Stack reports the server error details.
- Screenshots are available in `monitoring/screenshots/`.
---
## Grafana Dashboard
Grafana provides real-time visualization of system metrics and drift detection status.
### Configuration
- **Port**: `3000`
- **Credentials**: `admin` / `admin`
- **Dashboard**: Hopcroft Monitoring Dashboard
- **Datasource**: Prometheus (auto-provisioned)
- **Provisioning Files**:
- Datasources: `grafana/provisioning/datasources/prometheus.yml`
- Dashboards: `grafana/provisioning/dashboards/dashboard.yml`
- Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json`
### Dashboard Panels
1. **API Request Rate**: Rate of incoming requests per endpoint
2. **API Latency**: Average response time per endpoint
3. **Drift Detection Status**: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
4. **Drift P-Value**: Statistical significance of detected drift
5. **Drift Distance**: Kolmogorov-Smirnov distance metric
### Access
Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds.
---
## Data Drift Detection
Automated distribution shift detection using statistical testing to monitor model input data quality.
### Algorithm
- **Method**: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
- **Baseline Data**: 1000 samples from training set
- **Detection Threshold**: p-value < 0.05 (with Bonferroni correction)
- **Metrics Published**: drift_detected, drift_p_value, drift_distance, drift_check_timestamp
### Scripts
#### Baseline Preparation
**Script**: `drift/scripts/prepare_baseline.py`
Functionality:
- Loads data from SQLite database (`data/raw/skillscope_data.db`)
- Extracts numeric features only
- Samples 1000 representative records
- Saves to `drift/baseline/reference_data.pkl`
Usage:
```bash
cd monitoring/drift/scripts
python prepare_baseline.py
```
#### Drift Detection
**Script**: `drift/scripts/run_drift_check.py`
Functionality:
- Loads baseline reference data
- Compares with new production data
- Performs KS test on each feature
- Pushes metrics to Pushgateway
- Saves results to `drift/reports/`
Usage:
```bash
cd monitoring/drift/scripts
python run_drift_check.py
```
### Verification
Check Pushgateway metrics:
```bash
curl http://localhost:9091/metrics | grep drift
```
Query in Prometheus:
```promql
drift_detected
drift_p_value
drift_distance
```
---
## Pushgateway
Pushgateway collects metrics from short-lived jobs such as the drift detection script.
### Configuration
- **Port**: `9091`
- **Persistence**: Enabled with 5-minute intervals
- **Data Volume**: `pushgateway-data`
### Metrics Endpoint
Access metrics at `http://localhost:9091/metrics`
### Integration
The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.
---
## Alerting
Alert rules are defined in `prometheus/alert_rules.yml`:
- **High Latency**: Triggered when average latency exceeds 2 seconds
- **High Error Rate**: Triggered when error rate exceeds 5%
- **Data Drift Detected**: Triggered when drift_detected = 1
Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`.
---
## Complete Stack Usage
### Starting All Services
```bash
# Start all monitoring services
docker compose up -d
# Verify all containers are running
docker compose ps
# Check Prometheus targets
curl http://localhost:9090/targets
# Check Grafana health
curl http://localhost:3000/api/health
```
### Running Drift Detection Workflow
1. **Prepare Baseline (One-time setup)**
```bash
cd monitoring/drift/scripts
python prepare_baseline.py
```
2. **Execute Drift Check**
```bash
python run_drift_check.py
```
3. **Verify Results**
- Check Pushgateway: `http://localhost:9091`
- Check Prometheus: `http://localhost:9090/graph`
- Check Grafana: `http://localhost:3000`