petter2025 commited on
Commit
540525a
ยท
verified ยท
1 Parent(s): 8a5d251

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -1
README.md CHANGED
@@ -9,4 +9,133 @@ app_file: app.py
9
  pinned: false
10
  license: mit
11
  short_description: AI-powered reliability with multi-agent anomaly detection
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  license: mit
11
  short_description: AI-powered reliability with multi-agent anomaly detection
12
+ ---
13
+
14
+ # ๐Ÿง  Agentic Reliability Framework
15
+
16
+ **AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing**
17
+
18
+ ## ๐Ÿš€ Live Demo
19
+
20
+ **Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.
21
+
22
+ ## ๐ŸŽฏ What It Does
23
+
24
+ This framework transforms traditional monitoring into **autonomous reliability engineering**:
25
+
26
+ - **๐Ÿค– Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues
27
+ - **๐Ÿ”ง Automated Healing**: Policy-based auto-remediation for common failures
28
+ - **๐Ÿ’ฐ Business Impact**: Real-time revenue and user impact calculations
29
+ - **๐Ÿ“š Learning System**: FAISS-powered memory learns from every incident
30
+ - **โšก Production Ready**: Circuit breakers, adaptive thresholds, enterprise features
31
+
32
+ ## ๐Ÿ› ๏ธ Quick Start
33
+
34
+ ### 1. Select a Service
35
+ Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`
36
+
37
+ ### 2. Adjust Metrics
38
+ - **Latency P99**: Alert threshold >150ms (adaptive)
39
+ - **Error Rate**: Alert threshold >0.05 (5%)
40
+ - **Throughput**: Current requests per second
41
+ - **CPU/Memory**: Utilization (0.0-1.0 scale)
42
+
43
+ ### 3. Submit & Analyze
44
+ Click **"Submit Telemetry Event"** to see AI agents in action!
45
+
46
+ ## ๐Ÿ“Š Example Test Cases
47
+
48
+ ### ๐Ÿšจ Critical Failure
49
+ Component: api-service
50
+ Latency: 800ms
51
+ Error Rate: 0.25
52
+ CPU: 0.95
53
+ Memory: 0.90
54
+
55
+ text
56
+ *Expected: CRITICAL severity, circuit_breaker + scale_out actions*
57
+
58
+ ### โš ๏ธ Performance Issue
59
+ Component: auth-service
60
+ Latency: 350ms
61
+ Error Rate: 0.08
62
+ CPU: 0.75
63
+ Memory: 0.65
64
+
65
+ text
66
+ *Expected: HIGH severity, traffic_shift action*
67
+
68
+ ### โœ… Normal Operation
69
+ Component: payment-service
70
+ Latency: 120ms
71
+ Error Rate: 0.02
72
+ CPU: 0.45
73
+ Memory: 0.35
74
+
75
+ text
76
+ *Expected: NORMAL status, no actions needed*
77
+
78
+ ## ๐Ÿ”ง Technical Features
79
+
80
+ ### Multi-Agent Architecture
81
+ - **๐Ÿ•ต๏ธ Detective Agent**: Anomaly detection & pattern recognition
82
+ - **๐Ÿ” Diagnostician Agent**: Root cause analysis & investigation
83
+ - **๐Ÿค– Orchestration Manager**: Coordinates all agents in parallel
84
+
85
+ ### Smart Detection
86
+ - Adaptive thresholds that learn from your environment
87
+ - Multi-dimensional anomaly scoring (0-100% confidence)
88
+ - Correlation analysis across metrics
89
+ - FAISS vector memory for incident similarity
90
+
91
+ ### Business Intelligence
92
+ - Real-time revenue impact calculations
93
+ - User impact estimation
94
+ - Severity classification (LOW, MEDIUM, HIGH, CRITICAL)
95
+
96
+ ## ๐ŸŽฎ Try These Scenarios
97
+
98
+ ### Test 1: Resource Exhaustion
99
+ Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger
100
+
101
+ ### Test 2: High Latency + Errors
102
+ Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation
103
+
104
+ ### Test 3: Gradual Degradation
105
+ Start with normal values and slowly increase latency/errors to see adaptive thresholds
106
+
107
+ ## ๐Ÿšจ Default Alert Thresholds
108
+
109
+ | Metric | Warning | Critical |
110
+ |--------|---------|----------|
111
+ | Latency P99 | >150ms | >300ms |
112
+ | Error Rate | >0.05 | >0.15 |
113
+ | CPU Utilization | >0.8 | >0.9 |
114
+ | Memory Utilization | >0.8 | >0.9 |
115
+
116
+ ## ๐Ÿ”ฎ Roadmap
117
+
118
+ - [ ] Predictive anomaly detection
119
+ - [ ] Multi-cloud coordination
120
+ - [ ] Advanced root cause analysis
121
+ - [ ] Automated runbook execution
122
+ - [ ] Team learning and knowledge transfer
123
+
124
+ ## ๐Ÿ’ก Why This Matters
125
+
126
+ > "The most reliable system is the one that fixes itself before anyone notices there was a problem."
127
+
128
+ This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**.
129
+
130
+ ## ๐Ÿ› ๏ธ Technical Stack
131
+
132
+ - **Backend**: Python, FastAPI, Sentence Transformers
133
+ - **AI/ML**: FAISS, Hugging Face, Custom Agents
134
+ - **Frontend**: Gradio
135
+ - **Storage**: FAISS vector database, JSON metadata
136
+
137
+ ---
138
+
139
+ **Built with โค๏ธ by [Juan Petter](https://huggingface.co/petter2025)**
140
+
141
+ *AI Infrastructure Engineer | Building Self-Healing Agentic Systems*