Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

File size: 4,067 Bytes

---
title: Agentic Reliability Framework
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.1"
app_file: app.py
pinned: false
license: mit
short_description: AI-powered reliability with multi-agent anomaly detection
---

# 🧠 Agentic Reliability Framework

**AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing**

## 🚀 Live Demo

**Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.

## 🎯 What It Does

This framework transforms traditional monitoring into **autonomous reliability engineering**:

- **🤖 Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues
- **🔧 Automated Healing**: Policy-based auto-remediation for common failures
- **💰 Business Impact**: Real-time revenue and user impact calculations
- **📚 Learning System**: FAISS-powered memory learns from every incident
- **⚡ Production Ready**: Circuit breakers, adaptive thresholds, enterprise features

## 🛠️ Quick Start

### 1. Select a Service
Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`

### 2. Adjust Metrics
- **Latency P99**: Alert threshold >150ms (adaptive)
- **Error Rate**: Alert threshold >0.05 (5%)
- **Throughput**: Current requests per second
- **CPU/Memory**: Utilization (0.0-1.0 scale)

### 3. Submit & Analyze
Click **"Submit Telemetry Event"** to see AI agents in action!

## 📊 Example Test Cases

### 🚨 Critical Failure
Component: api-service
Latency: 800ms
Error Rate: 0.25
CPU: 0.95
Memory: 0.90

text
*Expected: CRITICAL severity, circuit_breaker + scale_out actions*

### ⚠️ Performance Issue
Component: auth-service
Latency: 350ms
Error Rate: 0.08
CPU: 0.75
Memory: 0.65

text
*Expected: HIGH severity, traffic_shift action*

### ✅ Normal Operation
Component: payment-service
Latency: 120ms
Error Rate: 0.02
CPU: 0.45
Memory: 0.35

text
*Expected: NORMAL status, no actions needed*

## 🔧 Technical Features

### Multi-Agent Architecture
- **🕵️ Detective Agent**: Anomaly detection & pattern recognition
- **🔍 Diagnostician Agent**: Root cause analysis & investigation
- **🤖 Orchestration Manager**: Coordinates all agents in parallel

### Smart Detection
- Adaptive thresholds that learn from your environment
- Multi-dimensional anomaly scoring (0-100% confidence)
- Correlation analysis across metrics
- FAISS vector memory for incident similarity

### Business Intelligence
- Real-time revenue impact calculations
- User impact estimation  
- Severity classification (LOW, MEDIUM, HIGH, CRITICAL)

## 🎮 Try These Scenarios

### Test 1: Resource Exhaustion
Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger

### Test 2: High Latency + Errors  
Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation

### Test 3: Gradual Degradation
Start with normal values and slowly increase latency/errors to see adaptive thresholds

## 🚨 Default Alert Thresholds

| Metric | Warning | Critical |
|--------|---------|----------|
| Latency P99 | >150ms | >300ms |
| Error Rate | >0.05 | >0.15 |
| CPU Utilization | >0.8 | >0.9 |
| Memory Utilization | >0.8 | >0.9 |

## 🔮 Roadmap

- [ ] Predictive anomaly detection
- [ ] Multi-cloud coordination  
- [ ] Advanced root cause analysis
- [ ] Automated runbook execution
- [ ] Team learning and knowledge transfer

## 💡 Why This Matters

> "The most reliable system is the one that fixes itself before anyone notices there was a problem."

This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**.

## 🛠️ Technical Stack

- **Backend**: Python, FastAPI, Sentence Transformers
- **AI/ML**: FAISS, Hugging Face, Custom Agents
- **Frontend**: Gradio
- **Storage**: FAISS vector database, JSON metadata

---

**Built with ❤️ by [Juan Petter](https://huggingface.co/petter2025)**

*AI Infrastructure Engineer | Building Self-Healing Agentic Systems*