Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

petter2025 commited on Nov 24, 2025

Commit

540525a

verified ·

1 Parent(s): 8a5d251

Update README.md

Browse files

Files changed (1) hide show

README.md +130 -1

README.md CHANGED Viewed

@@ -9,4 +9,133 @@ app_file: app.py
 pinned: false
 license: mit
 short_description: AI-powered reliability with multi-agent anomaly detection
----

 pinned: false
 license: mit
 short_description: AI-powered reliability with multi-agent anomaly detection
+---
+# 🧠 Agentic Reliability Framework
+**AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing**
+## 🚀 Live Demo
+**Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.
+## 🎯 What It Does
+This framework transforms traditional monitoring into **autonomous reliability engineering**:
+- **🤖 Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues
+- **🔧 Automated Healing**: Policy-based auto-remediation for common failures
+- **💰 Business Impact**: Real-time revenue and user impact calculations
+- **📚 Learning System**: FAISS-powered memory learns from every incident
+- **⚡ Production Ready**: Circuit breakers, adaptive thresholds, enterprise features
+## 🛠️ Quick Start
+### 1. Select a Service
+Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`
+### 2. Adjust Metrics
+- **Latency P99**: Alert threshold >150ms (adaptive)
+- **Error Rate**: Alert threshold >0.05 (5%)
+- **Throughput**: Current requests per second
+- **CPU/Memory**: Utilization (0.0-1.0 scale)
+### 3. Submit & Analyze
+Click **"Submit Telemetry Event"** to see AI agents in action!
+## 📊 Example Test Cases
+### 🚨 Critical Failure
+Component: api-service
+Latency: 800ms
+Error Rate: 0.25
+CPU: 0.95
+Memory: 0.90
+text
+*Expected: CRITICAL severity, circuit_breaker + scale_out actions*
+### ⚠️ Performance Issue
+Component: auth-service
+Latency: 350ms
+Error Rate: 0.08
+CPU: 0.75
+Memory: 0.65
+text
+*Expected: HIGH severity, traffic_shift action*
+### ✅ Normal Operation
+Component: payment-service
+Latency: 120ms
+Error Rate: 0.02
+CPU: 0.45
+Memory: 0.35
+text
+*Expected: NORMAL status, no actions needed*
+## 🔧 Technical Features
+### Multi-Agent Architecture
+- **🕵️ Detective Agent**: Anomaly detection & pattern recognition
+- **🔍 Diagnostician Agent**: Root cause analysis & investigation
+- **🤖 Orchestration Manager**: Coordinates all agents in parallel
+### Smart Detection
+- Adaptive thresholds that learn from your environment
+- Multi-dimensional anomaly scoring (0-100% confidence)
+- Correlation analysis across metrics
+- FAISS vector memory for incident similarity
+### Business Intelligence
+- Real-time revenue impact calculations
+- User impact estimation
+- Severity classification (LOW, MEDIUM, HIGH, CRITICAL)
+## 🎮 Try These Scenarios
+### Test 1: Resource Exhaustion
+Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger
+### Test 2: High Latency + Errors
+Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation
+### Test 3: Gradual Degradation
+Start with normal values and slowly increase latency/errors to see adaptive thresholds
+## 🚨 Default Alert Thresholds
+| Metric | Warning | Critical |
+|--------|---------|----------|
+| Latency P99 | >150ms | >300ms |
+| Error Rate | >0.05 | >0.15 |
+| CPU Utilization | >0.8 | >0.9 |
+| Memory Utilization | >0.8 | >0.9 |
+## 🔮 Roadmap
+- [ ] Predictive anomaly detection
+- [ ] Multi-cloud coordination
+- [ ] Advanced root cause analysis
+- [ ] Automated runbook execution
+- [ ] Team learning and knowledge transfer
+## 💡 Why This Matters
+> "The most reliable system is the one that fixes itself before anyone notices there was a problem."
+This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**.
+## 🛠️ Technical Stack
+- **Backend**: Python, FastAPI, Sentence Transformers
+- **AI/ML**: FAISS, Hugging Face, Custom Agents
+- **Frontend**: Gradio
+- **Storage**: FAISS vector database, JSON metadata
+---
+**Built with ❤️ by [Juan Petter](https://huggingface.co/petter2025)**
+*AI Infrastructure Engineer | Building Self-Healing Agentic Systems*