| --- |
| title: Agentic Reliability Framework |
| emoji: ๐ง |
| colorFrom: blue |
| colorTo: purple |
| sdk: gradio |
| sdk_version: "4.44.1" |
| app_file: app.py |
| pinned: false |
| license: mit |
| short_description: AI-powered reliability with multi-agent anomaly detection |
| --- |
| |
| # ๐ง Agentic Reliability Framework |
|
|
| **AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing** |
|
|
| ## ๐ Live Demo |
|
|
| **Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time. |
|
|
| ## ๐ฏ What It Does |
|
|
| This framework transforms traditional monitoring into **autonomous reliability engineering**: |
|
|
| - **๐ค Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues |
| - **๐ง Automated Healing**: Policy-based auto-remediation for common failures |
| - **๐ฐ Business Impact**: Real-time revenue and user impact calculations |
| - **๐ Learning System**: FAISS-powered memory learns from every incident |
| - **โก Production Ready**: Circuit breakers, adaptive thresholds, enterprise features |
|
|
| ## ๐ ๏ธ Quick Start |
|
|
| ### 1. Select a Service |
| Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service` |
|
|
| ### 2. Adjust Metrics |
| - **Latency P99**: Alert threshold >150ms (adaptive) |
| - **Error Rate**: Alert threshold >0.05 (5%) |
| - **Throughput**: Current requests per second |
| - **CPU/Memory**: Utilization (0.0-1.0 scale) |
|
|
| ### 3. Submit & Analyze |
| Click **"Submit Telemetry Event"** to see AI agents in action! |
|
|
| ## ๐ Example Test Cases |
|
|
| ### ๐จ Critical Failure |
| Component: api-service |
| Latency: 800ms |
| Error Rate: 0.25 |
| CPU: 0.95 |
| Memory: 0.90 |
|
|
| text |
| *Expected: CRITICAL severity, circuit_breaker + scale_out actions* |
|
|
| ### โ ๏ธ Performance Issue |
| Component: auth-service |
| Latency: 350ms |
| Error Rate: 0.08 |
| CPU: 0.75 |
| Memory: 0.65 |
|
|
| text |
| *Expected: HIGH severity, traffic_shift action* |
|
|
| ### โ
Normal Operation |
| Component: payment-service |
| Latency: 120ms |
| Error Rate: 0.02 |
| CPU: 0.45 |
| Memory: 0.35 |
|
|
| text |
| *Expected: NORMAL status, no actions needed* |
|
|
| ## ๐ง Technical Features |
|
|
| ### Multi-Agent Architecture |
| - **๐ต๏ธ Detective Agent**: Anomaly detection & pattern recognition |
| - **๐ Diagnostician Agent**: Root cause analysis & investigation |
| - **๐ค Orchestration Manager**: Coordinates all agents in parallel |
|
|
| ### Smart Detection |
| - Adaptive thresholds that learn from your environment |
| - Multi-dimensional anomaly scoring (0-100% confidence) |
| - Correlation analysis across metrics |
| - FAISS vector memory for incident similarity |
|
|
| ### Business Intelligence |
| - Real-time revenue impact calculations |
| - User impact estimation |
| - Severity classification (LOW, MEDIUM, HIGH, CRITICAL) |
|
|
| ## ๐ฎ Try These Scenarios |
|
|
| ### Test 1: Resource Exhaustion |
| Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger |
| |
| ### Test 2: High Latency + Errors |
| Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation |
| |
| ### Test 3: Gradual Degradation |
| Start with normal values and slowly increase latency/errors to see adaptive thresholds |
| |
| ## ๐จ Default Alert Thresholds |
| |
| | Metric | Warning | Critical | |
| |--------|---------|----------| |
| | Latency P99 | >150ms | >300ms | |
| | Error Rate | >0.05 | >0.15 | |
| | CPU Utilization | >0.8 | >0.9 | |
| | Memory Utilization | >0.8 | >0.9 | |
| |
| ## ๐ฎ Roadmap |
| |
| - [ ] Predictive anomaly detection |
| - [ ] Multi-cloud coordination |
| - [ ] Advanced root cause analysis |
| - [ ] Automated runbook execution |
| - [ ] Team learning and knowledge transfer |
| |
| ## ๐ก Why This Matters |
| |
| > "The most reliable system is the one that fixes itself before anyone notices there was a problem." |
| |
| This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**. |
| |
| ## ๐ ๏ธ Technical Stack |
| |
| - **Backend**: Python, FastAPI, Sentence Transformers |
| - **AI/ML**: FAISS, Hugging Face, Custom Agents |
| - **Frontend**: Gradio |
| - **Storage**: FAISS vector database, JSON metadata |
| |
| --- |
| |
| **Built with โค๏ธ by [Juan Petter](https://huggingface.co/petter2025)** |
| |
| *AI Infrastructure Engineer | Building Self-Healing Agentic Systems* |