Spaces:
Sleeping
Sleeping
DaCrow13
commited on
Commit
·
c2c8d10
1
Parent(s):
4a91a86
feat: Add Prometheus and Alertmanager services with alert rules, configurations, and a verification report.
Browse files
docker-compose.yml
CHANGED
|
@@ -52,10 +52,24 @@ services:
|
|
| 52 |
container_name: prometheus
|
| 53 |
volumes:
|
| 54 |
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
|
|
|
|
| 55 |
ports:
|
| 56 |
- "9090:9090"
|
| 57 |
networks:
|
| 58 |
- hopcroft-net
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
restart: unless-stopped
|
| 60 |
|
| 61 |
networks:
|
|
|
|
| 52 |
container_name: prometheus
|
| 53 |
volumes:
|
| 54 |
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
|
| 55 |
+
- ./monitoring/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
|
| 56 |
ports:
|
| 57 |
- "9090:9090"
|
| 58 |
networks:
|
| 59 |
- hopcroft-net
|
| 60 |
+
depends_on:
|
| 61 |
+
- alertmanager
|
| 62 |
+
restart: unless-stopped
|
| 63 |
+
|
| 64 |
+
alertmanager:
|
| 65 |
+
image: prom/alertmanager:latest
|
| 66 |
+
container_name: alertmanager
|
| 67 |
+
volumes:
|
| 68 |
+
- ./monitoring/alertmanager/config.yml:/etc/alertmanager/config.yml
|
| 69 |
+
ports:
|
| 70 |
+
- "9093:9093"
|
| 71 |
+
networks:
|
| 72 |
+
- hopcroft-net
|
| 73 |
restart: unless-stopped
|
| 74 |
|
| 75 |
networks:
|
monitoring/alertmanager/config.yml
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
global:
|
| 2 |
+
resolve_timeout: 5m
|
| 3 |
+
|
| 4 |
+
route:
|
| 5 |
+
group_by: ['alertname', 'severity']
|
| 6 |
+
group_wait: 10s
|
| 7 |
+
group_interval: 10s
|
| 8 |
+
repeat_interval: 1h
|
| 9 |
+
receiver: 'log-receiver'
|
| 10 |
+
|
| 11 |
+
receivers:
|
| 12 |
+
- name: 'log-receiver'
|
| 13 |
+
webhook_configs:
|
| 14 |
+
- url: 'http://hopcroft-api:8080/health'
|
| 15 |
+
|
| 16 |
+
inhibition_rules:
|
| 17 |
+
- source_match:
|
| 18 |
+
severity: 'critical'
|
| 19 |
+
target_match:
|
| 20 |
+
severity: 'warning'
|
| 21 |
+
equal: ['alertname', 'dev', 'instance']
|
monitoring/prometheus/alert_rules.yml
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
groups:
|
| 2 |
+
- name: hopcroft_alerts
|
| 3 |
+
rules:
|
| 4 |
+
- alert: ServiceDown
|
| 5 |
+
expr: up == 0
|
| 6 |
+
for: 1m
|
| 7 |
+
labels:
|
| 8 |
+
severity: critical
|
| 9 |
+
annotations:
|
| 10 |
+
summary: "Service {{ $labels.instance }} is down"
|
| 11 |
+
description: "The job {{ $labels.job }} has been down for more than 1 minute."
|
| 12 |
+
|
| 13 |
+
- alert: HighErrorRate
|
| 14 |
+
expr: |
|
| 15 |
+
sum(rate(hopcroft_requests_total{http_status=~"5.."}[5m]))
|
| 16 |
+
/
|
| 17 |
+
sum(rate(hopcroft_requests_total[5m])) > 0.1
|
| 18 |
+
for: 5m
|
| 19 |
+
labels:
|
| 20 |
+
severity: warning
|
| 21 |
+
annotations:
|
| 22 |
+
summary: "High error rate on {{ $labels.instance }}"
|
| 23 |
+
description: "Error rate is above 10% for the last 5 minutes (current value: {{ $value | printf \"%.2f\" }})."
|
| 24 |
+
|
| 25 |
+
- alert: SlowRequests
|
| 26 |
+
expr: histogram_quantile(0.95, sum by (le, endpoint) (rate(hopcroft_request_duration_seconds_bucket[5m]))) > 2
|
| 27 |
+
for: 5m
|
| 28 |
+
labels:
|
| 29 |
+
severity: warning
|
| 30 |
+
annotations:
|
| 31 |
+
summary: "Slow requests on {{ $labels.endpoint }}"
|
| 32 |
+
description: "95th percentile of request latency is above 2s (current value: {{ $value | printf \"%.2f\" }}s)."
|
monitoring/prometheus/prometheus.yml
CHANGED
|
@@ -1,5 +1,15 @@
|
|
| 1 |
global:
|
| 2 |
scrape_interval: 15s
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
scrape_configs:
|
| 5 |
- job_name: 'hopcroft-api'
|
|
|
|
| 1 |
global:
|
| 2 |
scrape_interval: 15s
|
| 3 |
+
evaluation_interval: 15s
|
| 4 |
+
|
| 5 |
+
rule_files:
|
| 6 |
+
- "alert_rules.yml"
|
| 7 |
+
|
| 8 |
+
alerting:
|
| 9 |
+
alertmanagers:
|
| 10 |
+
- static_configs:
|
| 11 |
+
- targets:
|
| 12 |
+
- 'alertmanager:9093'
|
| 13 |
|
| 14 |
scrape_configs:
|
| 15 |
- job_name: 'hopcroft-api'
|
reports/alerting_test_report/alerting_report.md
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Prometheus + Alertmanager Alerting Report
|
| 2 |
+
|
| 3 |
+
This report documents the configuration and verification of the alerting system for the Hopcroft Project.
|
| 4 |
+
|
| 5 |
+
## 1. Alerting Rules
|
| 6 |
+
The `monitoring/prometheus/alert_rules.yml` file is configured with the following rules:
|
| 7 |
+
- **ServiceDown**: Triggers if a service is unreachable for 1 minute.
|
| 8 |
+
- **HighErrorRate**: Triggers if the error rate exceeds 10%.
|
| 9 |
+
- **SlowRequests**: Triggers if the 95th percentile of request latency exceeds 2 seconds.
|
| 10 |
+
|
| 11 |
+
## 2. Alertmanager Configuration
|
| 12 |
+
The `monitoring/alertmanager/config.yml` file includes:
|
| 13 |
+
- **Grouping**: Alerts are grouped by `alertname` and `severity`.
|
| 14 |
+
- **Inhibition**: Critical alerts suppress warning-level alerts.
|
| 15 |
+
- **Receiver**: A webhook receiver is configured to forward notifications.
|
| 16 |
+
|
| 17 |
+
## 3. Verification of "Firing" Alert
|
| 18 |
+
The test was conducted by stopping the `hopcroft-api` container and waiting for the 1-minute threshold to be reached.
|
| 19 |
+
|
| 20 |
+
### Verification Proofs:
|
| 21 |
+
|
| 22 |
+
1. **Prometheus - Alert Firing**:
|
| 23 |
+
The following image shows the `ServiceDown` alert in the **FIRING** state within the Prometheus dashboard.
|
| 24 |
+

|
| 25 |
+
|
| 26 |
+
2. **Alertmanager - Notification Received**:
|
| 27 |
+
The following image shows the Alertmanager interface with the alert correctly received from Prometheus.
|
| 28 |
+

|
| 29 |
+
|
| 30 |
+
### Restoration
|
| 31 |
+
Following verification, the `hopcroft-api` service was restarted and monitored until it returned to a healthy state.
|