File size: 3,822 Bytes
220196d
 
 
 
 
 
 
 
 
 
 
 
 
047e6c3
 
220196d
047e6c3
220196d
047e6c3
220196d
 
 
 
 
 
 
 
047e6c3
220196d
047e6c3
220196d
047e6c3
220196d
 
 
 
 
 
 
 
047e6c3
620d849
 
220196d
047e6c3
220196d
047e6c3
220196d
 
 
 
 
 
047e6c3
 
220196d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
047e6c3
 
 
220196d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
047e6c3
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Agentic Reliability Framework MVP"
emoji: "🧠"
colorFrom: "indigo"
colorTo: "blue"
sdk: "gradio"
sdk_version: "5.49.1"
app_file: "app.py"
pinned: true
python_version: "3.10"
license: "mit"
---

# 🧠 Agentic Reliability Framework MVP

**Adaptive anomaly detection + AI-driven self-healing + persistent FAISS memory.**

This project explores **agentic reliability systems** β€” blending observability, vector-based persistence, and AI inference to create self-healing cloud operations.

Built with:
- ⚑ **Gradio 5.49.1** for live visualization & dashboard UI  
- 🧩 **FastAPI** for REST endpoints (`/add-event`) with API key support  
- 🧠 **Sentence Transformers** (`all-MiniLM-L6-v2`) for embedding-based anomaly memory  
- πŸ” **FAISS** for similarity search across past incidents  
- πŸ”’ **FileLock** for safe concurrent saves in multi-user environments  
- πŸ€– **Hugging Face Router Inference API** for adaptive reliability insights  
- ☁️ **Python 3.10** runtime

---

## πŸš€ Features

| Capability | Description |
|-------------|--------------|
| **Adaptive Anomaly Detection** | Detects anomalies dynamically based on latency and error-rate thresholds |
| **AI Root Cause Analysis** | Uses the Hugging Face Inference API for contextual one-line incident summaries |
| **Self-Healing Actions** | Simulates healing actions (scale-up, restart, etc.) |
| **Persistent Memory (FAISS)** | Learns from prior incidents, clusters patterns, and retrieves similar cases |
| **Secure REST API** | `/add-event` endpoint secured by `X-API-Key` header |
| **Interactive Gradio UI** | Visualize, test, and analyze events live in your browser |

---

## 🧠 Example Output

βœ… **Event Processed (Anomaly)**

Component: api-service
Latency: 224 ms
Error Rate: 0.062
Status: Anomaly
Analysis: Error 404: Not Found
Healing Action: Restarted container (Found 3 similar incidents)


---

## 🧩 Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gradio Frontend UI β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (submit telemetry)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI /add-event β”‚
β”‚ + API Key validation β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (call)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hugging Face Inference API β”‚
β”‚ β†’ Reliability insight text β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FAISS + Sentence Transformersβ”‚
β”‚ β†’ Embedding + similarity map β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

---

## 🧾 API Usage

**Endpoint:**  
`POST /add-event`

**Headers:**  
`X-API-Key: <your_api_key>`

**Body:**
```json
{
  "component": "api-service",
  "latency": 200,
  "error_rate": 0.04
}

{
  "status": "ok",
  "event": {
    "timestamp": "2025-11-08 23:29:03",
    "component": "api-service",
    "status": "Anomaly",
    "analysis": "Error 404: Not Found",
    "healing_action": "Restarted container Found 3 similar incidents ..."
  }
}

git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework
pip install -r requirements.txt
python app.py

Then open http://localhost:7860

🌍 Live Space & Collaboration

πŸ‘‰ Launch Live Demo on Hugging Face

πŸ‘‰ Contribute or Fork on GitHub

🧭 Author

Juan D. Petter
AI Engineer & Cloud Architect
Building Agentic Systems for Scalable Automation | ex-NetApp
πŸ”— LinkedIn
 β€’ GitHub

πŸͺͺ License

MIT License Β© 2025 Juan D. Petter