Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Sleeping

App Files Files Community

DaCrow13 commited on Dec 21, 2025

Commit

1708201

unverified ·

2 Parent(s): 9d3d6ec ac56b96

Merge pull request #31 from se4ai2526-uniba/Milestone-6-alerting

Browse files

Milestone 6: Implementation of Alerting System & Deployment Docs

Files changed (8) hide show

README.md +41 -0
docker-compose.yml +14 -0
monitoring/alertmanager/config.yml +21 -0
monitoring/prometheus/alert_rules.yml +32 -0
monitoring/prometheus/prometheus.yml +10 -0
reports/alerting_test_report/alerting_report.md +31 -0
reports/alerting_test_report/alertmanager_firing.png +3 -0
reports/alerting_test_report/prometheus_firing.png +3 -0

README.md CHANGED Viewed

@@ -478,6 +478,47 @@ docker-compose down
 ```
 ## Demo UI (Streamlit)
 The Streamlit GUI provides an interactive web interface for the skill classification API.

 ```
+--------
+## Hugging Face Spaces Deployment
+This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker.
+### 1. Setup Space
+1. Create a new Space on Hugging Face.
+2. Select **Docker** as the SDK.
+3. Choose the **Blank** template or upload your code.
+### 2. Configure Secrets
+To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings:
+| Name | Type | Description |
+|------|------|-------------|
+| `DAGSHUB_USERNAME` | Secret | Your DagsHub username. |
+| `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). |
+> [!IMPORTANT]
+> These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI.
+### 3. Automated Startup
+The deployment follows this automated flow:
+1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx.
+2. **scripts/start_space.sh**:
+   - Configures DVC with your secrets.
+   - Pulls models from the DagsHub remote.
+   - Starts the **FastAPI** backend (port 8000).
+   - Starts the **Streamlit** frontend (port 8501).
+   - Starts **Nginx** (port 7860) as a reverse proxy to route traffic.
+### 4. Direct Access
+Once deployed, your Space will be available at:
+`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft`
+The API documentation will be accessible at:
+`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs`
+--------
 ## Demo UI (Streamlit)
 The Streamlit GUI provides an interactive web interface for the skill classification API.

docker-compose.yml CHANGED Viewed

@@ -52,10 +52,24 @@ services:
     container_name: prometheus
     volumes:
       - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
     ports:
       - "9090:9090"
     networks:
       - hopcroft-net
     restart: unless-stopped
 networks:

     container_name: prometheus
     volumes:
       - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
+      - ./monitoring/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
     ports:
       - "9090:9090"
     networks:
       - hopcroft-net
+    depends_on:
+      - alertmanager
+    restart: unless-stopped
+  alertmanager:
+    image: prom/alertmanager:latest
+    container_name: alertmanager
+    volumes:
+      - ./monitoring/alertmanager/config.yml:/etc/alertmanager/config.yml
+    ports:
+      - "9093:9093"
+    networks:
+      - hopcroft-net
     restart: unless-stopped
 networks:

monitoring/alertmanager/config.yml ADDED Viewed

	@@ -0,0 +1,21 @@

+global:
+  resolve_timeout: 5m
+route:
+  group_by: ['alertname', 'severity']
+  group_wait: 10s
+  group_interval: 10s
+  repeat_interval: 1h
+  receiver: 'log-receiver'
+receivers:
+  - name: 'log-receiver'
+    webhook_configs:
+      - url: 'http://hopcroft-api:8080/health'
+inhibition_rules:
+  - source_match:
+      severity: 'critical'
+    target_match:
+      severity: 'warning'
+    equal: ['alertname', 'dev', 'instance']

monitoring/prometheus/alert_rules.yml ADDED Viewed

	@@ -0,0 +1,32 @@

+groups:
+  - name: hopcroft_alerts
+    rules:
+      - alert: ServiceDown
+        expr: up == 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Service {{ $labels.instance }} is down"
+          description: "The job {{ $labels.job }} has been down for more than 1 minute."
+      - alert: HighErrorRate
+        expr: |
+          sum(rate(hopcroft_requests_total{http_status=~"5.."}[5m]))
+          /
+          sum(rate(hopcroft_requests_total[5m])) > 0.1
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High error rate on {{ $labels.instance }}"
+          description: "Error rate is above 10% for the last 5 minutes (current value: {{ $value | printf \"%.2f\" }})."
+      - alert: SlowRequests
+        expr: histogram_quantile(0.95, sum by (le, endpoint) (rate(hopcroft_request_duration_seconds_bucket[5m]))) > 2
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Slow requests on {{ $labels.endpoint }}"
+          description: "95th percentile of request latency is above 2s (current value: {{ $value | printf \"%.2f\" }}s)."

monitoring/prometheus/prometheus.yml CHANGED Viewed

@@ -1,5 +1,15 @@
 global:
   scrape_interval: 15s
 scrape_configs:
   - job_name: 'hopcroft-api'

 global:
   scrape_interval: 15s
+  evaluation_interval: 15s
+rule_files:
+  - "alert_rules.yml"
+alerting:
+  alertmanagers:
+    - static_configs:
+        - targets:
+          - 'alertmanager:9093'
 scrape_configs:
   - job_name: 'hopcroft-api'

reports/alerting_test_report/alerting_report.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Prometheus + Alertmanager Alerting Report
+This report documents the configuration and verification of the alerting system for the Hopcroft Project.
+## 1. Alerting Rules
+The `monitoring/prometheus/alert_rules.yml` file is configured with the following rules:
+- **ServiceDown**: Triggers if a service is unreachable for 1 minute.
+- **HighErrorRate**: Triggers if the error rate exceeds 10%.
+- **SlowRequests**: Triggers if the 95th percentile of request latency exceeds 2 seconds.
+## 2. Alertmanager Configuration
+The `monitoring/alertmanager/config.yml` file includes:
+- **Grouping**: Alerts are grouped by `alertname` and `severity`.
+- **Inhibition**: Critical alerts suppress warning-level alerts.
+- **Receiver**: A webhook receiver is configured to forward notifications.
+## 3. Verification of "Firing" Alert
+The test was conducted by stopping the `hopcroft-api` container and waiting for the 1-minute threshold to be reached.
+### Verification Proofs:
+1. **Prometheus - Alert Firing**:
+The following image shows the `ServiceDown` alert in the **FIRING** state within the Prometheus dashboard.
+![Prometheus Firing Alert](prometheus_firing.png)
+2. **Alertmanager - Notification Received**:
+The following image shows the Alertmanager interface with the alert correctly received from Prometheus.
+![Alertmanager Firing Alert](alertmanager_firing.png)
+### Restoration
+Following verification, the `hopcroft-api` service was restarted and monitored until it returned to a healthy state.

reports/alerting_test_report/alertmanager_firing.png ADDED Viewed

Git LFS Details

SHA256: e1e229bae18738bef152e7401dc49c2f746f811c0dc0d8368153e97a2f50f6a0
Pointer size: 130 Bytes
Size of remote file: 46.2 kB

reports/alerting_test_report/prometheus_firing.png ADDED Viewed

Git LFS Details

SHA256: 2642504fc3d9c32e5f126f275bd4ff409cb0c832a9bb3836dd6fbd10452079d8
Pointer size: 130 Bytes
Size of remote file: 67.6 kB