Hopcroft-Skill-Classification

Sleeping

App Files Files Community

DaCrow13 commited on Dec 21, 2025

Commit

c2c8d10

1 Parent(s): 4a91a86

feat: Add Prometheus and Alertmanager services with alert rules, configurations, and a verification report.

Browse files

Files changed (5) hide show

docker-compose.yml +14 -0
monitoring/alertmanager/config.yml +21 -0
monitoring/prometheus/alert_rules.yml +32 -0
monitoring/prometheus/prometheus.yml +10 -0
reports/alerting_test_report/alerting_report.md +31 -0

docker-compose.yml CHANGED Viewed

@@ -52,10 +52,24 @@ services:
     container_name: prometheus
     volumes:
       - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
     ports:
       - "9090:9090"
     networks:
       - hopcroft-net
     restart: unless-stopped
 networks:

     container_name: prometheus
     volumes:
       - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
+      - ./monitoring/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
     ports:
       - "9090:9090"
     networks:
       - hopcroft-net
+    depends_on:
+      - alertmanager
+    restart: unless-stopped
+  alertmanager:
+    image: prom/alertmanager:latest
+    container_name: alertmanager
+    volumes:
+      - ./monitoring/alertmanager/config.yml:/etc/alertmanager/config.yml
+    ports:
+      - "9093:9093"
+    networks:
+      - hopcroft-net
     restart: unless-stopped
 networks:

monitoring/alertmanager/config.yml ADDED Viewed

	@@ -0,0 +1,21 @@

+global:
+  resolve_timeout: 5m
+route:
+  group_by: ['alertname', 'severity']
+  group_wait: 10s
+  group_interval: 10s
+  repeat_interval: 1h
+  receiver: 'log-receiver'
+receivers:
+  - name: 'log-receiver'
+    webhook_configs:
+      - url: 'http://hopcroft-api:8080/health'
+inhibition_rules:
+  - source_match:
+      severity: 'critical'
+    target_match:
+      severity: 'warning'
+    equal: ['alertname', 'dev', 'instance']

monitoring/prometheus/alert_rules.yml ADDED Viewed

	@@ -0,0 +1,32 @@

+groups:
+  - name: hopcroft_alerts
+    rules:
+      - alert: ServiceDown
+        expr: up == 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Service {{ $labels.instance }} is down"
+          description: "The job {{ $labels.job }} has been down for more than 1 minute."
+      - alert: HighErrorRate
+        expr: |
+          sum(rate(hopcroft_requests_total{http_status=~"5.."}[5m]))
+          /
+          sum(rate(hopcroft_requests_total[5m])) > 0.1
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High error rate on {{ $labels.instance }}"
+          description: "Error rate is above 10% for the last 5 minutes (current value: {{ $value | printf \"%.2f\" }})."
+      - alert: SlowRequests
+        expr: histogram_quantile(0.95, sum by (le, endpoint) (rate(hopcroft_request_duration_seconds_bucket[5m]))) > 2
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Slow requests on {{ $labels.endpoint }}"
+          description: "95th percentile of request latency is above 2s (current value: {{ $value | printf \"%.2f\" }}s)."

monitoring/prometheus/prometheus.yml CHANGED Viewed

@@ -1,5 +1,15 @@
 global:
   scrape_interval: 15s
 scrape_configs:
   - job_name: 'hopcroft-api'

 global:
   scrape_interval: 15s
+  evaluation_interval: 15s
+rule_files:
+  - "alert_rules.yml"
+alerting:
+  alertmanagers:
+    - static_configs:
+        - targets:
+          - 'alertmanager:9093'
 scrape_configs:
   - job_name: 'hopcroft-api'

reports/alerting_test_report/alerting_report.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Prometheus + Alertmanager Alerting Report
+This report documents the configuration and verification of the alerting system for the Hopcroft Project.
+## 1. Alerting Rules
+The `monitoring/prometheus/alert_rules.yml` file is configured with the following rules:
+- **ServiceDown**: Triggers if a service is unreachable for 1 minute.
+- **HighErrorRate**: Triggers if the error rate exceeds 10%.
+- **SlowRequests**: Triggers if the 95th percentile of request latency exceeds 2 seconds.
+## 2. Alertmanager Configuration
+The `monitoring/alertmanager/config.yml` file includes:
+- **Grouping**: Alerts are grouped by `alertname` and `severity`.
+- **Inhibition**: Critical alerts suppress warning-level alerts.
+- **Receiver**: A webhook receiver is configured to forward notifications.
+## 3. Verification of "Firing" Alert
+The test was conducted by stopping the `hopcroft-api` container and waiting for the 1-minute threshold to be reached.
+### Verification Proofs:
+1. **Prometheus - Alert Firing**:
+The following image shows the `ServiceDown` alert in the **FIRING** state within the Prometheus dashboard.
+![Prometheus Firing Alert](prometheus_firing.png)
+2. **Alertmanager - Notification Received**:
+The following image shows the Alertmanager interface with the alert correctly received from Prometheus.
+![Alertmanager Firing Alert](alertmanager_firing.png)
+### Restoration
+Following verification, the `hopcroft-api` service was restarted and monitored until it returned to a healthy state.