petter2025's picture
Update README.md
220196d verified
|
raw
history blame
3.82 kB
metadata
title: Agentic Reliability Framework MVP
emoji: 🧠
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
python_version: '3.10'
license: mit

🧠 Agentic Reliability Framework MVP

Adaptive anomaly detection + AI-driven self-healing + persistent FAISS memory.

This project explores agentic reliability systems β€” blending observability, vector-based persistence, and AI inference to create self-healing cloud operations.

Built with:

  • ⚑ Gradio 5.49.1 for live visualization & dashboard UI
  • 🧩 FastAPI for REST endpoints (/add-event) with API key support
  • 🧠 Sentence Transformers (all-MiniLM-L6-v2) for embedding-based anomaly memory
  • πŸ” FAISS for similarity search across past incidents
  • πŸ”’ FileLock for safe concurrent saves in multi-user environments
  • πŸ€– Hugging Face Router Inference API for adaptive reliability insights
  • ☁️ Python 3.10 runtime

πŸš€ Features

Capability Description
Adaptive Anomaly Detection Detects anomalies dynamically based on latency and error-rate thresholds
AI Root Cause Analysis Uses the Hugging Face Inference API for contextual one-line incident summaries
Self-Healing Actions Simulates healing actions (scale-up, restart, etc.)
Persistent Memory (FAISS) Learns from prior incidents, clusters patterns, and retrieves similar cases
Secure REST API /add-event endpoint secured by X-API-Key header
Interactive Gradio UI Visualize, test, and analyze events live in your browser

🧠 Example Output

βœ… Event Processed (Anomaly)

Component: api-service Latency: 224 ms Error Rate: 0.062 Status: Anomaly Analysis: Error 404: Not Found Healing Action: Restarted container (Found 3 similar incidents)


🧩 Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Gradio Frontend UI β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (submit telemetry) β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ FastAPI /add-event β”‚ β”‚ + API Key validation β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (call) β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Hugging Face Inference API β”‚ β”‚ β†’ Reliability insight text β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ FAISS + Sentence Transformersβ”‚ β”‚ β†’ Embedding + similarity map β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


🧾 API Usage

Endpoint:
POST /add-event

Headers:
X-API-Key: <your_api_key>

Body:

{
  "component": "api-service",
  "latency": 200,
  "error_rate": 0.04
}

{
  "status": "ok",
  "event": {
    "timestamp": "2025-11-08 23:29:03",
    "component": "api-service",
    "status": "Anomaly",
    "analysis": "Error 404: Not Found",
    "healing_action": "Restarted container Found 3 similar incidents ..."
  }
}

git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework
pip install -r requirements.txt
python app.py

Then open http://localhost:7860

🌍 Live Space & Collaboration

πŸ‘‰ Launch Live Demo on Hugging Face

πŸ‘‰ Contribute or Fork on GitHub

🧭 Author

Juan D. Petter
AI Engineer & Cloud Architect
Building Agentic Systems for Scalable Automation | ex-NetApp
πŸ”— LinkedIn
 β€’ GitHub

πŸͺͺ License

MIT License Β© 2025 Juan D. Petter