DaCrow13 commited on
Commit
c2c8d10
·
1 Parent(s): 4a91a86

feat: Add Prometheus and Alertmanager services with alert rules, configurations, and a verification report.

Browse files
docker-compose.yml CHANGED
@@ -52,10 +52,24 @@ services:
52
  container_name: prometheus
53
  volumes:
54
  - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
 
55
  ports:
56
  - "9090:9090"
57
  networks:
58
  - hopcroft-net
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  restart: unless-stopped
60
 
61
  networks:
 
52
  container_name: prometheus
53
  volumes:
54
  - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
55
+ - ./monitoring/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
56
  ports:
57
  - "9090:9090"
58
  networks:
59
  - hopcroft-net
60
+ depends_on:
61
+ - alertmanager
62
+ restart: unless-stopped
63
+
64
+ alertmanager:
65
+ image: prom/alertmanager:latest
66
+ container_name: alertmanager
67
+ volumes:
68
+ - ./monitoring/alertmanager/config.yml:/etc/alertmanager/config.yml
69
+ ports:
70
+ - "9093:9093"
71
+ networks:
72
+ - hopcroft-net
73
  restart: unless-stopped
74
 
75
  networks:
monitoring/alertmanager/config.yml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ global:
2
+ resolve_timeout: 5m
3
+
4
+ route:
5
+ group_by: ['alertname', 'severity']
6
+ group_wait: 10s
7
+ group_interval: 10s
8
+ repeat_interval: 1h
9
+ receiver: 'log-receiver'
10
+
11
+ receivers:
12
+ - name: 'log-receiver'
13
+ webhook_configs:
14
+ - url: 'http://hopcroft-api:8080/health'
15
+
16
+ inhibition_rules:
17
+ - source_match:
18
+ severity: 'critical'
19
+ target_match:
20
+ severity: 'warning'
21
+ equal: ['alertname', 'dev', 'instance']
monitoring/prometheus/alert_rules.yml ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ groups:
2
+ - name: hopcroft_alerts
3
+ rules:
4
+ - alert: ServiceDown
5
+ expr: up == 0
6
+ for: 1m
7
+ labels:
8
+ severity: critical
9
+ annotations:
10
+ summary: "Service {{ $labels.instance }} is down"
11
+ description: "The job {{ $labels.job }} has been down for more than 1 minute."
12
+
13
+ - alert: HighErrorRate
14
+ expr: |
15
+ sum(rate(hopcroft_requests_total{http_status=~"5.."}[5m]))
16
+ /
17
+ sum(rate(hopcroft_requests_total[5m])) > 0.1
18
+ for: 5m
19
+ labels:
20
+ severity: warning
21
+ annotations:
22
+ summary: "High error rate on {{ $labels.instance }}"
23
+ description: "Error rate is above 10% for the last 5 minutes (current value: {{ $value | printf \"%.2f\" }})."
24
+
25
+ - alert: SlowRequests
26
+ expr: histogram_quantile(0.95, sum by (le, endpoint) (rate(hopcroft_request_duration_seconds_bucket[5m]))) > 2
27
+ for: 5m
28
+ labels:
29
+ severity: warning
30
+ annotations:
31
+ summary: "Slow requests on {{ $labels.endpoint }}"
32
+ description: "95th percentile of request latency is above 2s (current value: {{ $value | printf \"%.2f\" }}s)."
monitoring/prometheus/prometheus.yml CHANGED
@@ -1,5 +1,15 @@
1
  global:
2
  scrape_interval: 15s
 
 
 
 
 
 
 
 
 
 
3
 
4
  scrape_configs:
5
  - job_name: 'hopcroft-api'
 
1
  global:
2
  scrape_interval: 15s
3
+ evaluation_interval: 15s
4
+
5
+ rule_files:
6
+ - "alert_rules.yml"
7
+
8
+ alerting:
9
+ alertmanagers:
10
+ - static_configs:
11
+ - targets:
12
+ - 'alertmanager:9093'
13
 
14
  scrape_configs:
15
  - job_name: 'hopcroft-api'
reports/alerting_test_report/alerting_report.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prometheus + Alertmanager Alerting Report
2
+
3
+ This report documents the configuration and verification of the alerting system for the Hopcroft Project.
4
+
5
+ ## 1. Alerting Rules
6
+ The `monitoring/prometheus/alert_rules.yml` file is configured with the following rules:
7
+ - **ServiceDown**: Triggers if a service is unreachable for 1 minute.
8
+ - **HighErrorRate**: Triggers if the error rate exceeds 10%.
9
+ - **SlowRequests**: Triggers if the 95th percentile of request latency exceeds 2 seconds.
10
+
11
+ ## 2. Alertmanager Configuration
12
+ The `monitoring/alertmanager/config.yml` file includes:
13
+ - **Grouping**: Alerts are grouped by `alertname` and `severity`.
14
+ - **Inhibition**: Critical alerts suppress warning-level alerts.
15
+ - **Receiver**: A webhook receiver is configured to forward notifications.
16
+
17
+ ## 3. Verification of "Firing" Alert
18
+ The test was conducted by stopping the `hopcroft-api` container and waiting for the 1-minute threshold to be reached.
19
+
20
+ ### Verification Proofs:
21
+
22
+ 1. **Prometheus - Alert Firing**:
23
+ The following image shows the `ServiceDown` alert in the **FIRING** state within the Prometheus dashboard.
24
+ ![Prometheus Firing Alert](prometheus_firing.png)
25
+
26
+ 2. **Alertmanager - Notification Received**:
27
+ The following image shows the Alertmanager interface with the alert correctly received from Prometheus.
28
+ ![Alertmanager Firing Alert](alertmanager_firing.png)
29
+
30
+ ### Restoration
31
+ Following verification, the `hopcroft-api` service was restarted and monitored until it returned to a healthy state.