petter2025's picture
Update README.md
7f15bf7 verified
|
raw
history blame
4.07 kB
---
title: Agentic Reliability Framework
emoji: ๐Ÿง 
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.1"
app_file: app.py
pinned: false
license: mit
short_description: AI-powered reliability with multi-agent anomaly detection
---
# ๐Ÿง  Agentic Reliability Framework
**AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing**
## ๐Ÿš€ Live Demo
**Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.
## ๐ŸŽฏ What It Does
This framework transforms traditional monitoring into **autonomous reliability engineering**:
- **๐Ÿค– Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues
- **๐Ÿ”ง Automated Healing**: Policy-based auto-remediation for common failures
- **๐Ÿ’ฐ Business Impact**: Real-time revenue and user impact calculations
- **๐Ÿ“š Learning System**: FAISS-powered memory learns from every incident
- **โšก Production Ready**: Circuit breakers, adaptive thresholds, enterprise features
## ๐Ÿ› ๏ธ Quick Start
### 1. Select a Service
Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`
### 2. Adjust Metrics
- **Latency P99**: Alert threshold >150ms (adaptive)
- **Error Rate**: Alert threshold >0.05 (5%)
- **Throughput**: Current requests per second
- **CPU/Memory**: Utilization (0.0-1.0 scale)
### 3. Submit & Analyze
Click **"Submit Telemetry Event"** to see AI agents in action!
## ๐Ÿ“Š Example Test Cases
### ๐Ÿšจ Critical Failure
Component: api-service
Latency: 800ms
Error Rate: 0.25
CPU: 0.95
Memory: 0.90
text
*Expected: CRITICAL severity, circuit_breaker + scale_out actions*
### โš ๏ธ Performance Issue
Component: auth-service
Latency: 350ms
Error Rate: 0.08
CPU: 0.75
Memory: 0.65
text
*Expected: HIGH severity, traffic_shift action*
### โœ… Normal Operation
Component: payment-service
Latency: 120ms
Error Rate: 0.02
CPU: 0.45
Memory: 0.35
text
*Expected: NORMAL status, no actions needed*
## ๐Ÿ”ง Technical Features
### Multi-Agent Architecture
- **๐Ÿ•ต๏ธ Detective Agent**: Anomaly detection & pattern recognition
- **๐Ÿ” Diagnostician Agent**: Root cause analysis & investigation
- **๐Ÿค– Orchestration Manager**: Coordinates all agents in parallel
### Smart Detection
- Adaptive thresholds that learn from your environment
- Multi-dimensional anomaly scoring (0-100% confidence)
- Correlation analysis across metrics
- FAISS vector memory for incident similarity
### Business Intelligence
- Real-time revenue impact calculations
- User impact estimation
- Severity classification (LOW, MEDIUM, HIGH, CRITICAL)
## ๐ŸŽฎ Try These Scenarios
### Test 1: Resource Exhaustion
Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger
### Test 2: High Latency + Errors
Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation
### Test 3: Gradual Degradation
Start with normal values and slowly increase latency/errors to see adaptive thresholds
## ๐Ÿšจ Default Alert Thresholds
| Metric | Warning | Critical |
|--------|---------|----------|
| Latency P99 | >150ms | >300ms |
| Error Rate | >0.05 | >0.15 |
| CPU Utilization | >0.8 | >0.9 |
| Memory Utilization | >0.8 | >0.9 |
## ๐Ÿ”ฎ Roadmap
- [ ] Predictive anomaly detection
- [ ] Multi-cloud coordination
- [ ] Advanced root cause analysis
- [ ] Automated runbook execution
- [ ] Team learning and knowledge transfer
## ๐Ÿ’ก Why This Matters
> "The most reliable system is the one that fixes itself before anyone notices there was a problem."
This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**.
## ๐Ÿ› ๏ธ Technical Stack
- **Backend**: Python, FastAPI, Sentence Transformers
- **AI/ML**: FAISS, Hugging Face, Custom Agents
- **Frontend**: Gradio
- **Storage**: FAISS vector database, JSON metadata
---
**Built with โค๏ธ by [Juan Petter](https://huggingface.co/petter2025)**
*AI Infrastructure Engineer | Building Self-Healing Agentic Systems*