Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

Agentic-Reliability-Framework-API / README.md

petter2025

Update README.md

7f15bf7 verified 4 months ago

preview code

raw

history blame

4.07 kB

	---
	title: Agentic Reliability Framework
	emoji: 🧠
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: "4.44.1"
	app_file: app.py
	pinned: false
	license: mit
	short_description: AI-powered reliability with multi-agent anomaly detection
	---

	# 🧠 Agentic Reliability Framework

	AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing

	## 🚀 Live Demo

	Try it now! Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.

	## 🎯 What It Does

	This framework transforms traditional monitoring into autonomous reliability engineering:

	- 🤖 Multi-Agent AI Analysis: Specialized agents work together to detect and diagnose issues
	- 🔧 Automated Healing: Policy-based auto-remediation for common failures
	- 💰 Business Impact: Real-time revenue and user impact calculations
	- 📚 Learning System: FAISS-powered memory learns from every incident
	- ⚡ Production Ready: Circuit breakers, adaptive thresholds, enterprise features

	## 🛠️ Quick Start

	### 1. Select a Service
	Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`

	### 2. Adjust Metrics
	- Latency P99: Alert threshold >150ms (adaptive)
	- Error Rate: Alert threshold >0.05 (5%)
	- Throughput: Current requests per second
	- CPU/Memory: Utilization (0.0-1.0 scale)

	### 3. Submit & Analyze
	Click "Submit Telemetry Event" to see AI agents in action!

	## 📊 Example Test Cases

	### 🚨 Critical Failure
	Component: api-service
	Latency: 800ms
	Error Rate: 0.25
	CPU: 0.95
	Memory: 0.90

	text
	Expected: CRITICAL severity, circuit_breaker + scale_out actions

	### ⚠️ Performance Issue
	Component: auth-service
	Latency: 350ms
	Error Rate: 0.08
	CPU: 0.75
	Memory: 0.65

	text
	Expected: HIGH severity, traffic_shift action

	### ✅ Normal Operation
	Component: payment-service
	Latency: 120ms
	Error Rate: 0.02
	CPU: 0.45
	Memory: 0.35

	text
	Expected: NORMAL status, no actions needed

	## 🔧 Technical Features

	### Multi-Agent Architecture
	- 🕵️ Detective Agent: Anomaly detection & pattern recognition
	- 🔍 Diagnostician Agent: Root cause analysis & investigation
	- 🤖 Orchestration Manager: Coordinates all agents in parallel

	### Smart Detection
	- Adaptive thresholds that learn from your environment
	- Multi-dimensional anomaly scoring (0-100% confidence)
	- Correlation analysis across metrics
	- FAISS vector memory for incident similarity

	### Business Intelligence
	- Real-time revenue impact calculations
	- User impact estimation
	- Severity classification (LOW, MEDIUM, HIGH, CRITICAL)

	## 🎮 Try These Scenarios

	### Test 1: Resource Exhaustion
	Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger

	### Test 2: High Latency + Errors
	Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation

	### Test 3: Gradual Degradation
	Start with normal values and slowly increase latency/errors to see adaptive thresholds

	## 🚨 Default Alert Thresholds

	\| Metric \| Warning \| Critical \|
	\|--------\|---------\|----------\|
	\| Latency P99 \| >150ms \| >300ms \|
	\| Error Rate \| >0.05 \| >0.15 \|
	\| CPU Utilization \| >0.8 \| >0.9 \|
	\| Memory Utilization \| >0.8 \| >0.9 \|

	## 🔮 Roadmap

	- [ ] Predictive anomaly detection
	- [ ] Multi-cloud coordination
	- [ ] Advanced root cause analysis
	- [ ] Automated runbook execution
	- [ ] Team learning and knowledge transfer

	## 💡 Why This Matters

	> "The most reliable system is the one that fixes itself before anyone notices there was a problem."

	This framework represents the evolution from reactive monitoring to proactive, autonomous reliability engineering.

	## 🛠️ Technical Stack

	- Backend: Python, FastAPI, Sentence Transformers
	- AI/ML: FAISS, Hugging Face, Custom Agents
	- Frontend: Gradio
	- Storage: FAISS vector database, JSON metadata

	---

	Built with ❤️ by [Juan Petter](https://huggingface.co/petter2025)

	AI Infrastructure Engineer \| Building Self-Healing Agentic Systems