File size: 4,067 Bytes
8a5d251
 
 
 
 
 
7f15bf7
8a5d251
 
 
 
540525a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
title: Agentic Reliability Framework
emoji: ๐Ÿง 
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.1"
app_file: app.py
pinned: false
license: mit
short_description: AI-powered reliability with multi-agent anomaly detection
---

# ๐Ÿง  Agentic Reliability Framework

**AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing**

## ๐Ÿš€ Live Demo

**Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.

## ๐ŸŽฏ What It Does

This framework transforms traditional monitoring into **autonomous reliability engineering**:

- **๐Ÿค– Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues
- **๐Ÿ”ง Automated Healing**: Policy-based auto-remediation for common failures
- **๐Ÿ’ฐ Business Impact**: Real-time revenue and user impact calculations
- **๐Ÿ“š Learning System**: FAISS-powered memory learns from every incident
- **โšก Production Ready**: Circuit breakers, adaptive thresholds, enterprise features

## ๐Ÿ› ๏ธ Quick Start

### 1. Select a Service
Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`

### 2. Adjust Metrics
- **Latency P99**: Alert threshold >150ms (adaptive)
- **Error Rate**: Alert threshold >0.05 (5%)
- **Throughput**: Current requests per second
- **CPU/Memory**: Utilization (0.0-1.0 scale)

### 3. Submit & Analyze
Click **"Submit Telemetry Event"** to see AI agents in action!

## ๐Ÿ“Š Example Test Cases

### ๐Ÿšจ Critical Failure
Component: api-service
Latency: 800ms
Error Rate: 0.25
CPU: 0.95
Memory: 0.90

text
*Expected: CRITICAL severity, circuit_breaker + scale_out actions*

### โš ๏ธ Performance Issue
Component: auth-service
Latency: 350ms
Error Rate: 0.08
CPU: 0.75
Memory: 0.65

text
*Expected: HIGH severity, traffic_shift action*

### โœ… Normal Operation
Component: payment-service
Latency: 120ms
Error Rate: 0.02
CPU: 0.45
Memory: 0.35

text
*Expected: NORMAL status, no actions needed*

## ๐Ÿ”ง Technical Features

### Multi-Agent Architecture
- **๐Ÿ•ต๏ธ Detective Agent**: Anomaly detection & pattern recognition
- **๐Ÿ” Diagnostician Agent**: Root cause analysis & investigation
- **๐Ÿค– Orchestration Manager**: Coordinates all agents in parallel

### Smart Detection
- Adaptive thresholds that learn from your environment
- Multi-dimensional anomaly scoring (0-100% confidence)
- Correlation analysis across metrics
- FAISS vector memory for incident similarity

### Business Intelligence
- Real-time revenue impact calculations
- User impact estimation  
- Severity classification (LOW, MEDIUM, HIGH, CRITICAL)

## ๐ŸŽฎ Try These Scenarios

### Test 1: Resource Exhaustion
Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger

### Test 2: High Latency + Errors  
Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation

### Test 3: Gradual Degradation
Start with normal values and slowly increase latency/errors to see adaptive thresholds

## ๐Ÿšจ Default Alert Thresholds

| Metric | Warning | Critical |
|--------|---------|----------|
| Latency P99 | >150ms | >300ms |
| Error Rate | >0.05 | >0.15 |
| CPU Utilization | >0.8 | >0.9 |
| Memory Utilization | >0.8 | >0.9 |

## ๐Ÿ”ฎ Roadmap

- [ ] Predictive anomaly detection
- [ ] Multi-cloud coordination  
- [ ] Advanced root cause analysis
- [ ] Automated runbook execution
- [ ] Team learning and knowledge transfer

## ๐Ÿ’ก Why This Matters

> "The most reliable system is the one that fixes itself before anyone notices there was a problem."

This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**.

## ๐Ÿ› ๏ธ Technical Stack

- **Backend**: Python, FastAPI, Sentence Transformers
- **AI/ML**: FAISS, Hugging Face, Custom Agents
- **Frontend**: Gradio
- **Storage**: FAISS vector database, JSON metadata

---

**Built with โค๏ธ by [Juan Petter](https://huggingface.co/petter2025)**

*AI Infrastructure Engineer | Building Self-Healing Agentic Systems*