Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / monitoring /README.md

DaCrow13

Configura Prometheus per girare su Hugging Face Space

c9732a0 about 1 month ago

preview code

raw

history blame contribute delete

5.83 kB

	# Metrics Collection & Verification

	This directory contains the configuration for Prometheus monitoring.

	## Configuration
	- Prometheus Config: `prometheus/prometheus.yml`
	- Scrape Target: `hopcroft-api:8080`
	- Metrics Endpoint: `http://localhost:8080/metrics`

	## Verification Queries (PromQL)

	You can run these queries in the Prometheus Expression Browser (`http://localhost:9090/graph`):

	### 1. Request Rate (Counter)
	Shows the rate of requests per second over the last minute.
	```promql
	rate(hopcroft_requests_total[1m])
	```

	### 2. Average Request Duration (Histogram)
	Calculates average latency.
	```promql
	rate(hopcroft_request_duration_seconds_sum[5m]) / rate(hopcroft_request_duration_seconds_count[5m])
	```

	### 3. Current In-Progress Requests (Gauge)
	Shows how many requests are currently being processed.
	```promql
	hopcroft_in_progress_requests
	```

	### 4. Model Prediction Time (Summary)
	Shows the 90th percentile of model prediction time.
	```promql
	hopcroft_prediction_processing_seconds{quantile="0.9"}
	```

	---

	## Uptime Monitoring (Better Stack)

	We used Better Stack Uptime to monitor the availability of the production deployment hosted on Hugging Face Spaces.

	Base URL
	- https://dacrow13-hopcroft-skill-classification.hf.space

	Monitored endpoints
	- https://dacrow13-hopcroft-skill-classification.hf.space/health
	- https://dacrow13-hopcroft-skill-classification.hf.space/openapi.json
	- https://dacrow13-hopcroft-skill-classification.hf.space/docs

	## Prometheus on Hugging Face Space

	Prometheus is also running directly on the Hugging Face Space and is accessible at:
	- https://dacrow13-hopcroft-skill-classification.hf.space/prometheus/

	Checks and alerts
	- Monitors are configured to run from multiple locations.
	- Email notifications are enabled for failures.
	- A failure scenario was tested to confirm Better Stack reports the server error details.

	- Screenshots are available in `monitoring/screenshots/`.

	---

	## Grafana Dashboard

	Grafana provides real-time visualization of system metrics and drift detection status.

	### Configuration
	- Port: `3000`
	- Credentials: `admin` / `admin`
	- Dashboard: Hopcroft Monitoring Dashboard
	- Datasource: Prometheus (auto-provisioned)
	- Provisioning Files:
	- Datasources: `grafana/provisioning/datasources/prometheus.yml`
	- Dashboards: `grafana/provisioning/dashboards/dashboard.yml`
	- Dashboard JSON: `grafana/dashboards/hopcroft_dashboard.json`

	### Dashboard Panels
	1. API Request Rate: Rate of incoming requests per endpoint
	2. API Latency: Average response time per endpoint
	3. Drift Detection Status: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
	4. Drift P-Value: Statistical significance of detected drift
	5. Drift Distance: Kolmogorov-Smirnov distance metric

	### Access
	Navigate to `http://localhost:3000` and login with the provided credentials. The dashboard refreshes every 10 seconds.

	---

	## Data Drift Detection

	Automated distribution shift detection using statistical testing to monitor model input data quality.

	### Algorithm
	- Method: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
	- Baseline Data: 1000 samples from training set
	- Detection Threshold: p-value < 0.05 (with Bonferroni correction)
	- Metrics Published: drift_detected, drift_p_value, drift_distance, drift_check_timestamp

	### Scripts

	#### Baseline Preparation
	Script: `drift/scripts/prepare_baseline.py`

	Functionality:
	- Loads data from SQLite database (`data/raw/skillscope_data.db`)
	- Extracts numeric features only
	- Samples 1000 representative records
	- Saves to `drift/baseline/reference_data.pkl`

	Usage:
	```bash
	cd monitoring/drift/scripts
	python prepare_baseline.py
	```

	#### Drift Detection
	Script: `drift/scripts/run_drift_check.py`

	Functionality:
	- Loads baseline reference data
	- Compares with new production data
	- Performs KS test on each feature
	- Pushes metrics to Pushgateway
	- Saves results to `drift/reports/`

	Usage:
	```bash
	cd monitoring/drift/scripts
	python run_drift_check.py
	```

	### Verification
	Check Pushgateway metrics:
	```bash
	curl http://localhost:9091/metrics \| grep drift
	```

	Query in Prometheus:
	```promql
	drift_detected
	drift_p_value
	drift_distance
	```

	---

	## Pushgateway

	Pushgateway collects metrics from short-lived jobs such as the drift detection script.

	### Configuration
	- Port: `9091`
	- Persistence: Enabled with 5-minute intervals
	- Data Volume: `pushgateway-data`

	### Metrics Endpoint
	Access metrics at `http://localhost:9091/metrics`

	### Integration
	The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.

	---

	## Alerting

	Alert rules are defined in `prometheus/alert_rules.yml`:

	- High Latency: Triggered when average latency exceeds 2 seconds
	- High Error Rate: Triggered when error rate exceeds 5%
	- Data Drift Detected: Triggered when drift_detected = 1

	Alerts are routed to Alertmanager (`http://localhost:9093`) and can be configured to send notifications via email, Slack, or other channels in `alertmanager/config.yml`.

	---

	## Complete Stack Usage

	### Starting All Services
	```bash
	# Start all monitoring services
	docker compose up -d

	# Verify all containers are running
	docker compose ps

	# Check Prometheus targets
	curl http://localhost:9090/targets

	# Check Grafana health
	curl http://localhost:3000/api/health
	```

	### Running Drift Detection Workflow

	1. Prepare Baseline (One-time setup)
	```bash
	cd monitoring/drift/scripts
	python prepare_baseline.py
	```

	2. Execute Drift Check
	```bash
	python run_drift_check.py
	```

	3. Verify Results
	- Check Pushgateway: `http://localhost:9091`
	- Check Prometheus: `http://localhost:9090/graph`
	- Check Grafana: `http://localhost:3000`