DaCrow13 commited on
Commit
1708201
·
unverified ·
2 Parent(s): 9d3d6ec ac56b96

Merge pull request #31 from se4ai2526-uniba/Milestone-6-alerting

Browse files

Milestone 6: Implementation of Alerting System & Deployment Docs

README.md CHANGED
@@ -478,6 +478,47 @@ docker-compose down
478
  ```
479
 
480
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
481
  ## Demo UI (Streamlit)
482
 
483
  The Streamlit GUI provides an interactive web interface for the skill classification API.
 
478
  ```
479
 
480
 
481
+ --------
482
+
483
+ ## Hugging Face Spaces Deployment
484
+
485
+ This project is configured to run on [Hugging Face Spaces](https://huggingface.co/spaces) using Docker.
486
+
487
+ ### 1. Setup Space
488
+ 1. Create a new Space on Hugging Face.
489
+ 2. Select **Docker** as the SDK.
490
+ 3. Choose the **Blank** template or upload your code.
491
+
492
+ ### 2. Configure Secrets
493
+ To enable the application to pull models from DagsHub via DVC, you must configure the following **Variables and Secrets** in your Space settings:
494
+
495
+ | Name | Type | Description |
496
+ |------|------|-------------|
497
+ | `DAGSHUB_USERNAME` | Secret | Your DagsHub username. |
498
+ | `DAGSHUB_TOKEN` | Secret | Your DagsHub access token (Settings -> Tokens). |
499
+
500
+ > [!IMPORTANT]
501
+ > These secrets are injected into the container at runtime. The `scripts/start_space.sh` script uses them to authenticate DVC and pull the required model files (`.pkl`) before starting the API and GUI.
502
+
503
+ ### 3. Automated Startup
504
+ The deployment follows this automated flow:
505
+ 1. **Dockerfile**: Builds the environment, installs dependencies, and sets up Nginx.
506
+ 2. **scripts/start_space.sh**:
507
+ - Configures DVC with your secrets.
508
+ - Pulls models from the DagsHub remote.
509
+ - Starts the **FastAPI** backend (port 8000).
510
+ - Starts the **Streamlit** frontend (port 8501).
511
+ - Starts **Nginx** (port 7860) as a reverse proxy to route traffic.
512
+
513
+ ### 4. Direct Access
514
+ Once deployed, your Space will be available at:
515
+ `https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft`
516
+
517
+ The API documentation will be accessible at:
518
+ `https://huggingface.co/spaces/se4ai2526-uniba/Hopcroft/docs`
519
+
520
+ --------
521
+
522
  ## Demo UI (Streamlit)
523
 
524
  The Streamlit GUI provides an interactive web interface for the skill classification API.
docker-compose.yml CHANGED
@@ -52,10 +52,24 @@ services:
52
  container_name: prometheus
53
  volumes:
54
  - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
 
55
  ports:
56
  - "9090:9090"
57
  networks:
58
  - hopcroft-net
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  restart: unless-stopped
60
 
61
  networks:
 
52
  container_name: prometheus
53
  volumes:
54
  - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
55
+ - ./monitoring/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
56
  ports:
57
  - "9090:9090"
58
  networks:
59
  - hopcroft-net
60
+ depends_on:
61
+ - alertmanager
62
+ restart: unless-stopped
63
+
64
+ alertmanager:
65
+ image: prom/alertmanager:latest
66
+ container_name: alertmanager
67
+ volumes:
68
+ - ./monitoring/alertmanager/config.yml:/etc/alertmanager/config.yml
69
+ ports:
70
+ - "9093:9093"
71
+ networks:
72
+ - hopcroft-net
73
  restart: unless-stopped
74
 
75
  networks:
monitoring/alertmanager/config.yml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ global:
2
+ resolve_timeout: 5m
3
+
4
+ route:
5
+ group_by: ['alertname', 'severity']
6
+ group_wait: 10s
7
+ group_interval: 10s
8
+ repeat_interval: 1h
9
+ receiver: 'log-receiver'
10
+
11
+ receivers:
12
+ - name: 'log-receiver'
13
+ webhook_configs:
14
+ - url: 'http://hopcroft-api:8080/health'
15
+
16
+ inhibition_rules:
17
+ - source_match:
18
+ severity: 'critical'
19
+ target_match:
20
+ severity: 'warning'
21
+ equal: ['alertname', 'dev', 'instance']
monitoring/prometheus/alert_rules.yml ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ groups:
2
+ - name: hopcroft_alerts
3
+ rules:
4
+ - alert: ServiceDown
5
+ expr: up == 0
6
+ for: 1m
7
+ labels:
8
+ severity: critical
9
+ annotations:
10
+ summary: "Service {{ $labels.instance }} is down"
11
+ description: "The job {{ $labels.job }} has been down for more than 1 minute."
12
+
13
+ - alert: HighErrorRate
14
+ expr: |
15
+ sum(rate(hopcroft_requests_total{http_status=~"5.."}[5m]))
16
+ /
17
+ sum(rate(hopcroft_requests_total[5m])) > 0.1
18
+ for: 5m
19
+ labels:
20
+ severity: warning
21
+ annotations:
22
+ summary: "High error rate on {{ $labels.instance }}"
23
+ description: "Error rate is above 10% for the last 5 minutes (current value: {{ $value | printf \"%.2f\" }})."
24
+
25
+ - alert: SlowRequests
26
+ expr: histogram_quantile(0.95, sum by (le, endpoint) (rate(hopcroft_request_duration_seconds_bucket[5m]))) > 2
27
+ for: 5m
28
+ labels:
29
+ severity: warning
30
+ annotations:
31
+ summary: "Slow requests on {{ $labels.endpoint }}"
32
+ description: "95th percentile of request latency is above 2s (current value: {{ $value | printf \"%.2f\" }}s)."
monitoring/prometheus/prometheus.yml CHANGED
@@ -1,5 +1,15 @@
1
  global:
2
  scrape_interval: 15s
 
 
 
 
 
 
 
 
 
 
3
 
4
  scrape_configs:
5
  - job_name: 'hopcroft-api'
 
1
  global:
2
  scrape_interval: 15s
3
+ evaluation_interval: 15s
4
+
5
+ rule_files:
6
+ - "alert_rules.yml"
7
+
8
+ alerting:
9
+ alertmanagers:
10
+ - static_configs:
11
+ - targets:
12
+ - 'alertmanager:9093'
13
 
14
  scrape_configs:
15
  - job_name: 'hopcroft-api'
reports/alerting_test_report/alerting_report.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prometheus + Alertmanager Alerting Report
2
+
3
+ This report documents the configuration and verification of the alerting system for the Hopcroft Project.
4
+
5
+ ## 1. Alerting Rules
6
+ The `monitoring/prometheus/alert_rules.yml` file is configured with the following rules:
7
+ - **ServiceDown**: Triggers if a service is unreachable for 1 minute.
8
+ - **HighErrorRate**: Triggers if the error rate exceeds 10%.
9
+ - **SlowRequests**: Triggers if the 95th percentile of request latency exceeds 2 seconds.
10
+
11
+ ## 2. Alertmanager Configuration
12
+ The `monitoring/alertmanager/config.yml` file includes:
13
+ - **Grouping**: Alerts are grouped by `alertname` and `severity`.
14
+ - **Inhibition**: Critical alerts suppress warning-level alerts.
15
+ - **Receiver**: A webhook receiver is configured to forward notifications.
16
+
17
+ ## 3. Verification of "Firing" Alert
18
+ The test was conducted by stopping the `hopcroft-api` container and waiting for the 1-minute threshold to be reached.
19
+
20
+ ### Verification Proofs:
21
+
22
+ 1. **Prometheus - Alert Firing**:
23
+ The following image shows the `ServiceDown` alert in the **FIRING** state within the Prometheus dashboard.
24
+ ![Prometheus Firing Alert](prometheus_firing.png)
25
+
26
+ 2. **Alertmanager - Notification Received**:
27
+ The following image shows the Alertmanager interface with the alert correctly received from Prometheus.
28
+ ![Alertmanager Firing Alert](alertmanager_firing.png)
29
+
30
+ ### Restoration
31
+ Following verification, the `hopcroft-api` service was restarted and monitored until it returned to a healthy state.
reports/alerting_test_report/alertmanager_firing.png ADDED

Git LFS Details

  • SHA256: e1e229bae18738bef152e7401dc49c2f746f811c0dc0d8368153e97a2f50f6a0
  • Pointer size: 130 Bytes
  • Size of remote file: 46.2 kB
reports/alerting_test_report/prometheus_firing.png ADDED

Git LFS Details

  • SHA256: 2642504fc3d9c32e5f126f275bd4ff409cb0c832a9bb3836dd6fbd10452079d8
  • Pointer size: 130 Bytes
  • Size of remote file: 67.6 kB