Merge pull request #31 from se4ai2526-uniba/Milestone-6-alerting
Browse filesMilestone 6: Implementation of Alerting System & Deployment Docs
- README.md +41 -0
- docker-compose.yml +14 -0
- monitoring/alertmanager/config.yml +21 -0
- monitoring/prometheus/alert_rules.yml +32 -0
- monitoring/prometheus/prometheus.yml +10 -0
- reports/alerting_test_report/alerting_report.md +31 -0
- reports/alerting_test_report/alertmanager_firing.png +3 -0
- reports/alerting_test_report/prometheus_firing.png +3 -0
README.md
CHANGED
|
@@ -478,6 +478,47 @@ docker-compose down
|
|
| 478 |
```
|
| 479 |
|
| 480 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 481 |
## Demo UI (Streamlit)
|
| 482 |
|
| 483 |
The Streamlit GUI provides an interactive web interface for the skill classification API.
|
|
|
|
| 478 |
```
|
| 479 |
|
| 480 |
|
| 481 |
+
--------
|
| 482 |
+
|
| 483 |
+
## Hugging Face Spaces Deployment
|
| 484 |
+
|
| 485 |
+
This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker.
|
| 486 |
+
|
| 487 |
+
### 1. Setup Space
|
| 488 |
+
1. Create a new Space on Hugging Face.
|
| 489 |
+
2. Select **Docker** as the SDK.
|
| 490 |
+
3. Choose the **Blank** template or upload your code.
|
| 491 |
+
|
| 492 |
+
### 2. Configure Secrets
|
| 493 |
+
To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings:
|
| 494 |
+
|
| 495 |
+
| Name | Type | Description |
|
| 496 |
+
|------|------|-------------|
|
| 497 |
+
| `DAGSHUB_USERNAME` | Secret | Your DagsHub username. |
|
| 498 |
+
| `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). |
|
| 499 |
+
|
| 500 |
+
> [!IMPORTANT]
|
| 501 |
+
> These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI.
|
| 502 |
+
|
| 503 |
+
### 3. Automated Startup
|
| 504 |
+
The deployment follows this automated flow:
|
| 505 |
+
1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx.
|
| 506 |
+
2. **scripts/start_space.sh**:
|
| 507 |
+
- Configures DVC with your secrets.
|
| 508 |
+
- Pulls models from the DagsHub remote.
|
| 509 |
+
- Starts the **FastAPI** backend (port 8000).
|
| 510 |
+
- Starts the **Streamlit** frontend (port 8501).
|
| 511 |
+
- Starts **Nginx** (port 7860) as a reverse proxy to route traffic.
|
| 512 |
+
|
| 513 |
+
### 4. Direct Access
|
| 514 |
+
Once deployed, your Space will be available at:
|
| 515 |
+
`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft`
|
| 516 |
+
|
| 517 |
+
The API documentation will be accessible at:
|
| 518 |
+
`https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs`
|
| 519 |
+
|
| 520 |
+
--------
|
| 521 |
+
|
| 522 |
## Demo UI (Streamlit)
|
| 523 |
|
| 524 |
The Streamlit GUI provides an interactive web interface for the skill classification API.
|
docker-compose.yml
CHANGED
|
@@ -52,10 +52,24 @@ services:
|
|
| 52 |
container_name: prometheus
|
| 53 |
volumes:
|
| 54 |
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
|
|
|
|
| 55 |
ports:
|
| 56 |
- "9090:9090"
|
| 57 |
networks:
|
| 58 |
- hopcroft-net
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
restart: unless-stopped
|
| 60 |
|
| 61 |
networks:
|
|
|
|
| 52 |
container_name: prometheus
|
| 53 |
volumes:
|
| 54 |
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
|
| 55 |
+
- ./monitoring/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
|
| 56 |
ports:
|
| 57 |
- "9090:9090"
|
| 58 |
networks:
|
| 59 |
- hopcroft-net
|
| 60 |
+
depends_on:
|
| 61 |
+
- alertmanager
|
| 62 |
+
restart: unless-stopped
|
| 63 |
+
|
| 64 |
+
alertmanager:
|
| 65 |
+
image: prom/alertmanager:latest
|
| 66 |
+
container_name: alertmanager
|
| 67 |
+
volumes:
|
| 68 |
+
- ./monitoring/alertmanager/config.yml:/etc/alertmanager/config.yml
|
| 69 |
+
ports:
|
| 70 |
+
- "9093:9093"
|
| 71 |
+
networks:
|
| 72 |
+
- hopcroft-net
|
| 73 |
restart: unless-stopped
|
| 74 |
|
| 75 |
networks:
|
monitoring/alertmanager/config.yml
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
global:
|
| 2 |
+
resolve_timeout: 5m
|
| 3 |
+
|
| 4 |
+
route:
|
| 5 |
+
group_by: ['alertname', 'severity']
|
| 6 |
+
group_wait: 10s
|
| 7 |
+
group_interval: 10s
|
| 8 |
+
repeat_interval: 1h
|
| 9 |
+
receiver: 'log-receiver'
|
| 10 |
+
|
| 11 |
+
receivers:
|
| 12 |
+
- name: 'log-receiver'
|
| 13 |
+
webhook_configs:
|
| 14 |
+
- url: 'http://hopcroft-api:8080/health'
|
| 15 |
+
|
| 16 |
+
inhibition_rules:
|
| 17 |
+
- source_match:
|
| 18 |
+
severity: 'critical'
|
| 19 |
+
target_match:
|
| 20 |
+
severity: 'warning'
|
| 21 |
+
equal: ['alertname', 'dev', 'instance']
|
monitoring/prometheus/alert_rules.yml
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
groups:
|
| 2 |
+
- name: hopcroft_alerts
|
| 3 |
+
rules:
|
| 4 |
+
- alert: ServiceDown
|
| 5 |
+
expr: up == 0
|
| 6 |
+
for: 1m
|
| 7 |
+
labels:
|
| 8 |
+
severity: critical
|
| 9 |
+
annotations:
|
| 10 |
+
summary: "Service {{ $labels.instance }} is down"
|
| 11 |
+
description: "The job {{ $labels.job }} has been down for more than 1 minute."
|
| 12 |
+
|
| 13 |
+
- alert: HighErrorRate
|
| 14 |
+
expr: |
|
| 15 |
+
sum(rate(hopcroft_requests_total{http_status=~"5.."}[5m]))
|
| 16 |
+
/
|
| 17 |
+
sum(rate(hopcroft_requests_total[5m])) > 0.1
|
| 18 |
+
for: 5m
|
| 19 |
+
labels:
|
| 20 |
+
severity: warning
|
| 21 |
+
annotations:
|
| 22 |
+
summary: "High error rate on {{ $labels.instance }}"
|
| 23 |
+
description: "Error rate is above 10% for the last 5 minutes (current value: {{ $value | printf \"%.2f\" }})."
|
| 24 |
+
|
| 25 |
+
- alert: SlowRequests
|
| 26 |
+
expr: histogram_quantile(0.95, sum by (le, endpoint) (rate(hopcroft_request_duration_seconds_bucket[5m]))) > 2
|
| 27 |
+
for: 5m
|
| 28 |
+
labels:
|
| 29 |
+
severity: warning
|
| 30 |
+
annotations:
|
| 31 |
+
summary: "Slow requests on {{ $labels.endpoint }}"
|
| 32 |
+
description: "95th percentile of request latency is above 2s (current value: {{ $value | printf \"%.2f\" }}s)."
|
monitoring/prometheus/prometheus.yml
CHANGED
|
@@ -1,5 +1,15 @@
|
|
| 1 |
global:
|
| 2 |
scrape_interval: 15s
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
scrape_configs:
|
| 5 |
- job_name: 'hopcroft-api'
|
|
|
|
| 1 |
global:
|
| 2 |
scrape_interval: 15s
|
| 3 |
+
evaluation_interval: 15s
|
| 4 |
+
|
| 5 |
+
rule_files:
|
| 6 |
+
- "alert_rules.yml"
|
| 7 |
+
|
| 8 |
+
alerting:
|
| 9 |
+
alertmanagers:
|
| 10 |
+
- static_configs:
|
| 11 |
+
- targets:
|
| 12 |
+
- 'alertmanager:9093'
|
| 13 |
|
| 14 |
scrape_configs:
|
| 15 |
- job_name: 'hopcroft-api'
|
reports/alerting_test_report/alerting_report.md
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Prometheus + Alertmanager Alerting Report
|
| 2 |
+
|
| 3 |
+
This report documents the configuration and verification of the alerting system for the Hopcroft Project.
|
| 4 |
+
|
| 5 |
+
## 1. Alerting Rules
|
| 6 |
+
The `monitoring/prometheus/alert_rules.yml` file is configured with the following rules:
|
| 7 |
+
- **ServiceDown**: Triggers if a service is unreachable for 1 minute.
|
| 8 |
+
- **HighErrorRate**: Triggers if the error rate exceeds 10%.
|
| 9 |
+
- **SlowRequests**: Triggers if the 95th percentile of request latency exceeds 2 seconds.
|
| 10 |
+
|
| 11 |
+
## 2. Alertmanager Configuration
|
| 12 |
+
The `monitoring/alertmanager/config.yml` file includes:
|
| 13 |
+
- **Grouping**: Alerts are grouped by `alertname` and `severity`.
|
| 14 |
+
- **Inhibition**: Critical alerts suppress warning-level alerts.
|
| 15 |
+
- **Receiver**: A webhook receiver is configured to forward notifications.
|
| 16 |
+
|
| 17 |
+
## 3. Verification of "Firing" Alert
|
| 18 |
+
The test was conducted by stopping the `hopcroft-api` container and waiting for the 1-minute threshold to be reached.
|
| 19 |
+
|
| 20 |
+
### Verification Proofs:
|
| 21 |
+
|
| 22 |
+
1. **Prometheus - Alert Firing**:
|
| 23 |
+
The following image shows the `ServiceDown` alert in the **FIRING** state within the Prometheus dashboard.
|
| 24 |
+

|
| 25 |
+
|
| 26 |
+
2. **Alertmanager - Notification Received**:
|
| 27 |
+
The following image shows the Alertmanager interface with the alert correctly received from Prometheus.
|
| 28 |
+

|
| 29 |
+
|
| 30 |
+
### Restoration
|
| 31 |
+
Following verification, the `hopcroft-api` service was restarted and monitored until it returned to a healthy state.
|
reports/alerting_test_report/alertmanager_firing.png
ADDED
|
Git LFS Details
|
reports/alerting_test_report/prometheus_firing.png
ADDED
|
Git LFS Details
|