petter2025's picture
Update README.md
7f15bf7 verified
|
raw
history blame
4.07 kB
metadata
title: Agentic Reliability Framework
emoji: ๐Ÿง 
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
short_description: AI-powered reliability with multi-agent anomaly detection

๐Ÿง  Agentic Reliability Framework

AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing

๐Ÿš€ Live Demo

Try it now! Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.

๐ŸŽฏ What It Does

This framework transforms traditional monitoring into autonomous reliability engineering:

  • ๐Ÿค– Multi-Agent AI Analysis: Specialized agents work together to detect and diagnose issues
  • ๐Ÿ”ง Automated Healing: Policy-based auto-remediation for common failures
  • ๐Ÿ’ฐ Business Impact: Real-time revenue and user impact calculations
  • ๐Ÿ“š Learning System: FAISS-powered memory learns from every incident
  • โšก Production Ready: Circuit breakers, adaptive thresholds, enterprise features

๐Ÿ› ๏ธ Quick Start

1. Select a Service

Choose from: api-service, auth-service, payment-service, database, cache-service

2. Adjust Metrics

  • Latency P99: Alert threshold >150ms (adaptive)
  • Error Rate: Alert threshold >0.05 (5%)
  • Throughput: Current requests per second
  • CPU/Memory: Utilization (0.0-1.0 scale)

3. Submit & Analyze

Click "Submit Telemetry Event" to see AI agents in action!

๐Ÿ“Š Example Test Cases

๐Ÿšจ Critical Failure

Component: api-service Latency: 800ms Error Rate: 0.25 CPU: 0.95 Memory: 0.90

text Expected: CRITICAL severity, circuit_breaker + scale_out actions

โš ๏ธ Performance Issue

Component: auth-service Latency: 350ms Error Rate: 0.08 CPU: 0.75 Memory: 0.65

text Expected: HIGH severity, traffic_shift action

โœ… Normal Operation

Component: payment-service Latency: 120ms Error Rate: 0.02 CPU: 0.45 Memory: 0.35

text Expected: NORMAL status, no actions needed

๐Ÿ”ง Technical Features

Multi-Agent Architecture

  • ๐Ÿ•ต๏ธ Detective Agent: Anomaly detection & pattern recognition
  • ๐Ÿ” Diagnostician Agent: Root cause analysis & investigation
  • ๐Ÿค– Orchestration Manager: Coordinates all agents in parallel

Smart Detection

  • Adaptive thresholds that learn from your environment
  • Multi-dimensional anomaly scoring (0-100% confidence)
  • Correlation analysis across metrics
  • FAISS vector memory for incident similarity

Business Intelligence

  • Real-time revenue impact calculations
  • User impact estimation
  • Severity classification (LOW, MEDIUM, HIGH, CRITICAL)

๐ŸŽฎ Try These Scenarios

Test 1: Resource Exhaustion

Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger

Test 2: High Latency + Errors

Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation

Test 3: Gradual Degradation

Start with normal values and slowly increase latency/errors to see adaptive thresholds

๐Ÿšจ Default Alert Thresholds

Metric Warning Critical
Latency P99 >150ms >300ms
Error Rate >0.05 >0.15
CPU Utilization >0.8 >0.9
Memory Utilization >0.8 >0.9

๐Ÿ”ฎ Roadmap

  • Predictive anomaly detection
  • Multi-cloud coordination
  • Advanced root cause analysis
  • Automated runbook execution
  • Team learning and knowledge transfer

๐Ÿ’ก Why This Matters

"The most reliable system is the one that fixes itself before anyone notices there was a problem."

This framework represents the evolution from reactive monitoring to proactive, autonomous reliability engineering.

๐Ÿ› ๏ธ Technical Stack

  • Backend: Python, FastAPI, Sentence Transformers
  • AI/ML: FAISS, Hugging Face, Custom Agents
  • Frontend: Gradio
  • Storage: FAISS vector database, JSON metadata

Built with โค๏ธ by Juan Petter

AI Infrastructure Engineer | Building Self-Healing Agentic Systems