Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

Agentic-Reliability-Framework-API / README.md

petter2025

Update README.md

7f15bf7 verified 4 months ago

4.07 kB

title: Agentic Reliability Framework
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
short_description: AI-powered reliability with multi-agent anomaly detection

🧠 Agentic Reliability Framework

AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing

🚀 Live Demo

Try it now! Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.

🎯 What It Does

This framework transforms traditional monitoring into autonomous reliability engineering:

🤖 Multi-Agent AI Analysis: Specialized agents work together to detect and diagnose issues
🔧 Automated Healing: Policy-based auto-remediation for common failures
💰 Business Impact: Real-time revenue and user impact calculations
📚 Learning System: FAISS-powered memory learns from every incident
⚡ Production Ready: Circuit breakers, adaptive thresholds, enterprise features

🛠️ Quick Start

1. Select a Service

Choose from: api-service, auth-service, payment-service, database, cache-service

2. Adjust Metrics

Latency P99: Alert threshold >150ms (adaptive)
Error Rate: Alert threshold >0.05 (5%)
Throughput: Current requests per second
CPU/Memory: Utilization (0.0-1.0 scale)

3. Submit & Analyze

Click "Submit Telemetry Event" to see AI agents in action!

📊 Example Test Cases

🚨 Critical Failure

Component: api-service Latency: 800ms Error Rate: 0.25 CPU: 0.95 Memory: 0.90

text Expected: CRITICAL severity, circuit_breaker + scale_out actions

⚠️ Performance Issue

Component: auth-service Latency: 350ms Error Rate: 0.08 CPU: 0.75 Memory: 0.65

text Expected: HIGH severity, traffic_shift action

✅ Normal Operation

Component: payment-service Latency: 120ms Error Rate: 0.02 CPU: 0.45 Memory: 0.35

text Expected: NORMAL status, no actions needed

🔧 Technical Features

Multi-Agent Architecture

🕵️ Detective Agent: Anomaly detection & pattern recognition
🔍 Diagnostician Agent: Root cause analysis & investigation
🤖 Orchestration Manager: Coordinates all agents in parallel

Smart Detection

Adaptive thresholds that learn from your environment
Multi-dimensional anomaly scoring (0-100% confidence)
Correlation analysis across metrics
FAISS vector memory for incident similarity

Business Intelligence

Real-time revenue impact calculations
User impact estimation
Severity classification (LOW, MEDIUM, HIGH, CRITICAL)

🎮 Try These Scenarios

Test 1: Resource Exhaustion

Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger

Test 2: High Latency + Errors

Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation

Test 3: Gradual Degradation

Start with normal values and slowly increase latency/errors to see adaptive thresholds

🚨 Default Alert Thresholds

Metric	Warning	Critical
Latency P99	>150ms	>300ms
Error Rate	>0.05	>0.15
CPU Utilization	>0.8	>0.9
Memory Utilization	>0.8	>0.9

🔮 Roadmap

Predictive anomaly detection
Multi-cloud coordination
Advanced root cause analysis
Automated runbook execution
Team learning and knowledge transfer

💡 Why This Matters

"The most reliable system is the one that fixes itself before anyone notices there was a problem."

This framework represents the evolution from reactive monitoring to proactive, autonomous reliability engineering.

🛠️ Technical Stack

Backend: Python, FastAPI, Sentence Transformers
AI/ML: FAISS, Hugging Face, Custom Agents
Frontend: Gradio
Storage: FAISS vector database, JSON metadata

Built with ❤️ by Juan Petter

AI Infrastructure Engineer | Building Self-Healing Agentic Systems