File size: 4,067 Bytes
8a5d251 7f15bf7 8a5d251 540525a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | ---
title: Agentic Reliability Framework
emoji: ๐ง
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.1"
app_file: app.py
pinned: false
license: mit
short_description: AI-powered reliability with multi-agent anomaly detection
---
# ๐ง Agentic Reliability Framework
**AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing**
## ๐ Live Demo
**Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.
## ๐ฏ What It Does
This framework transforms traditional monitoring into **autonomous reliability engineering**:
- **๐ค Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues
- **๐ง Automated Healing**: Policy-based auto-remediation for common failures
- **๐ฐ Business Impact**: Real-time revenue and user impact calculations
- **๐ Learning System**: FAISS-powered memory learns from every incident
- **โก Production Ready**: Circuit breakers, adaptive thresholds, enterprise features
## ๐ ๏ธ Quick Start
### 1. Select a Service
Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`
### 2. Adjust Metrics
- **Latency P99**: Alert threshold >150ms (adaptive)
- **Error Rate**: Alert threshold >0.05 (5%)
- **Throughput**: Current requests per second
- **CPU/Memory**: Utilization (0.0-1.0 scale)
### 3. Submit & Analyze
Click **"Submit Telemetry Event"** to see AI agents in action!
## ๐ Example Test Cases
### ๐จ Critical Failure
Component: api-service
Latency: 800ms
Error Rate: 0.25
CPU: 0.95
Memory: 0.90
text
*Expected: CRITICAL severity, circuit_breaker + scale_out actions*
### โ ๏ธ Performance Issue
Component: auth-service
Latency: 350ms
Error Rate: 0.08
CPU: 0.75
Memory: 0.65
text
*Expected: HIGH severity, traffic_shift action*
### โ
Normal Operation
Component: payment-service
Latency: 120ms
Error Rate: 0.02
CPU: 0.45
Memory: 0.35
text
*Expected: NORMAL status, no actions needed*
## ๐ง Technical Features
### Multi-Agent Architecture
- **๐ต๏ธ Detective Agent**: Anomaly detection & pattern recognition
- **๐ Diagnostician Agent**: Root cause analysis & investigation
- **๐ค Orchestration Manager**: Coordinates all agents in parallel
### Smart Detection
- Adaptive thresholds that learn from your environment
- Multi-dimensional anomaly scoring (0-100% confidence)
- Correlation analysis across metrics
- FAISS vector memory for incident similarity
### Business Intelligence
- Real-time revenue impact calculations
- User impact estimation
- Severity classification (LOW, MEDIUM, HIGH, CRITICAL)
## ๐ฎ Try These Scenarios
### Test 1: Resource Exhaustion
Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger
### Test 2: High Latency + Errors
Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation
### Test 3: Gradual Degradation
Start with normal values and slowly increase latency/errors to see adaptive thresholds
## ๐จ Default Alert Thresholds
| Metric | Warning | Critical |
|--------|---------|----------|
| Latency P99 | >150ms | >300ms |
| Error Rate | >0.05 | >0.15 |
| CPU Utilization | >0.8 | >0.9 |
| Memory Utilization | >0.8 | >0.9 |
## ๐ฎ Roadmap
- [ ] Predictive anomaly detection
- [ ] Multi-cloud coordination
- [ ] Advanced root cause analysis
- [ ] Automated runbook execution
- [ ] Team learning and knowledge transfer
## ๐ก Why This Matters
> "The most reliable system is the one that fixes itself before anyone notices there was a problem."
This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**.
## ๐ ๏ธ Technical Stack
- **Backend**: Python, FastAPI, Sentence Transformers
- **AI/ML**: FAISS, Hugging Face, Custom Agents
- **Frontend**: Gradio
- **Storage**: FAISS vector database, JSON metadata
---
**Built with โค๏ธ by [Juan Petter](https://huggingface.co/petter2025)**
*AI Infrastructure Engineer | Building Self-Healing Agentic Systems* |