File size: 8,626 Bytes
b6a939e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
<p align="center">
  <img src="https://dummyimage.com/1200x260/000/fff&text=AGENTIC+RELIABILITY+FRAMEWORK" width="100%" alt="Agentic Reliability Framework Banner" />
</p>

<h1 align="center">โš™๏ธ Agentic Reliability Framework</h1>

<p align="center">
  <strong>Adaptive anomaly detection + policy-driven self-healing for AI systems</strong><br>
  Minimal, fast, and production-focused.
</p>

<p align="center">
  <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.10+-blue" alt="Python 3.10+"></a>
  <a href="#"><img src="https://img.shields.io/badge/status-MVP-green" alt="Status: MVP"></a>
  <a href="#"><img src="https://img.shields.io/badge/license-MIT-lightgrey" alt="License: MIT"></a>
</p>

## ๐Ÿง  Agentic Reliability Framework

**Autonomous Reliability Engineering for Production AI Systems**

Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms.

## โญ Key Features

- **Real-time anomaly detection** across latency, errors, throughput & resources
- **Root-cause analysis** with evidence correlation
- **Predictive forecasting** (15-minute lookahead)
- **Automated healing policies** (restart, rollback, scale, circuit break)
- **Incident memory** with FAISS for semantic recall
- **Security hardened** (all CVEs patched)
- **Thread-safe, async, process-pooled architecture**
- **Sub-100ms end-to-end latency** (p50)

## ๐Ÿ” Security Hardening (v2.0)

| CVE | Severity | Component | Status |
|-----|----------|-----------|--------|
| CVE-2025-23042 | 9.1 | Gradio Path Traversal | โœ… Patched |
| CVE-2025-48889 | 7.5 | Gradio SVG DOS | โœ… Patched |
| CVE-2025-5320 | 6.5 | Gradio File Override | โœ… Patched |
| CVE-2023-32681 | 6.1 | Requests Credential Leak | โœ… Patched |
| CVE-2024-47081 | 5.3 | Requests .netrc Leak | โœ… Patched |

### Additional Hardening

- SHA-256 hashing everywhere (no MD5)
- Pydantic v2 input validation
- Rate limiting (60 req/min/user)
- Atomic operations w/ thread-safe FAISS single-writer pattern
- Lock-free reads for high throughput

## โšก Lock-Free Reads for High Throughput

By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load.

### Performance Impact

| Metric | Before | After | ฮ” |
|--------|--------|-------|---|
| Event Processing (p50) | ~350ms | ~100ms | โšก 71% faster |
| Event Processing (p99) | ~800ms | ~250ms | โšก 69% faster |
| Agent Orchestration | Sequential | Parallel | 3ร— throughput |
| Memory Behavior | Growing | Stable / Bounded | 0 leaks |

## ๐Ÿงฉ Architecture Overview

### System Flow

```
Your Production System
(APIs, Databases, Microservices)
           โ†“
  Agentic Reliability Core
  Detect โ†’ Diagnose โ†’ Predict
           โ†“
        Agents:
  ๐Ÿ•ต๏ธ Detective Agent โ€“ Anomaly detection
  ๐Ÿ” Diagnostician Agent โ€“ Root cause analysis
  ๐Ÿ”ฎ Predictive Agent โ€“ Forecasting / risk estimation
           โ†“
    Policy Engine (Auto-Healing)
           โ†“
    Healing Actions:
    โ€ข Restart
    โ€ข Scale
    โ€ข Rollback
    โ€ข Circuit-break
```

## ๐Ÿ—๏ธ Core Framework Components

### Web Framework & UI

- **Gradio 5.50+** - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860)
- **Python 3.10+** - Core implementation with asynchronous, thread-safe architecture

### AI/ML Stack

- **FAISS-CPU 1.13.0** - Facebook AI Similarity Search for persistent incident memory and vector operations
- **SentenceTransformers 5.1.1** - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis
- **NumPy 1.26.4** - Numerical computing foundation for vector operations and data processing

### Data & HTTP Layer

- **Pydantic 2.11+** - Type-safe data modeling with frozen models for immutability and runtime validation
- **Requests 2.32.5** - HTTP client library for external API communication (security patched)

### Reliability & Resilience

- **CircuitBreaker 2.0+** - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention
- **AtomicWrites 1.4.1** - Atomic file operations ensuring data consistency and durability

## ๐ŸŽฏ Architecture Pattern

ARF implements a **Multi-Agent Orchestration Pattern** with three specialized agents:

- **Detective Agent** - Anomaly detection
- **Diagnostician Agent** - Root cause analysis
- **Predictive Agent** - Future risk forecasting

All agents run in **parallel** (not sequential) for **3ร— throughput improvement**.

### โšก Performance Features

- Native async handlers (no event loop overhead)
- Thread-safe single-writer/multi-reader pattern for FAISS
- RLock-protected policy evaluation
- Queue-based writes to prevent race conditions
- Sub-100ms p50 latency at 100+ events/second

The framework combines **Gradio** for the web/UI layer, **FAISS** for vector memory, and **SentenceTransformers** for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability.

## ๐Ÿงช The Three Agents

### ๐Ÿ•ต๏ธ Detective Agent โ€” Anomaly Detection

Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade.

- Adaptive multi-metric scoring
- CPU/mem resource anomaly detection
- Latency & error spike detection
- Confidence scoring (0โ€“1)

### ๐Ÿ” Diagnostician Agent (Root Cause Analysis)

Identifies patterns such as:

- DB connection pool exhaustion
- Dependency timeouts
- Resource saturation
- App-layer regressions
- Misconfigurations

### ๐Ÿ”ฎ Predictive Agent (Forecasting)

- 15-minute risk projection
- Trend analysis
- Time-to-failure estimates
- Risk levels: low โ†’ critical

## ๐Ÿš€ Quick Start

### 1. Clone

```bash
git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework
```

### 2. Create environment

```bash
python3.10 -m venv venv
source venv/bin/activate     # Windows: venv\Scripts\activate
```

### 3. Install

```bash
pip install -r requirements.txt
```

### 4. Start

```bash
python app.py
```

**UI:** http://localhost:7860

## ๐Ÿ›  Configuration

Create `.env`:

```env
HF_TOKEN=your_token
DATA_DIR=./data
INDEX_FILE=data/incident_vectors.index
LOG_LEVEL=INFO
HOST=0.0.0.0
PORT=7860
```

**Note:** `HF_TOKEN` is optional and used for downloading SentenceTransformer models from Hugging Face Hub.

## ๐Ÿงฉ Custom Healing Policies

```python
custom = HealingPolicy(
    name="custom_latency",
    conditions=[PolicyCondition("latency_p99", "gt", 200)],
    actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM],
    priority=1,
    cool_down_seconds=300,
    max_executions_per_hour=5,
)
```

## ๐Ÿณ Docker Deployment

Dockerfile and docker-compose.yml included.

```bash
docker-compose up -d
```

## ๐Ÿ“ˆ Performance Benchmarks

**On Intel i7, 16GB RAM:**

| Component | p50 | p99 |
|-----------|-----|-----|
| Total End-to-End | ~100ms | ~250ms |
| Policy Engine | 19ms | 38ms |
| Vector Encoding | 15ms | 30ms |

**Stable memory:** ~250MB  
**Throughput:** 100+ events/sec

## ๐Ÿงช Testing

### Production Dependencies

```bash
pip install -r requirements.txt
```

### Development Dependencies

```bash
pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy
```

### Run Tests

```bash
pytest tests/ -v --cov
```

**Coverage:** 87%

Includes:
- Unit tests
- Thread-safety tests
- Stress tests
- Integration tests

### Code Quality

```bash
# Format code
black .

# Lint code
ruff check .

# Type checking
mypy app.py
```

## ๐Ÿ—บ Roadmap

### v2.1

- Distributed FAISS
- Prometheus / Grafana
- Slack & PagerDuty integration
- Custom alerting DSL

### v3.0

- Reinforcement learning for policy optimization
- LSTM forecasting
- Dependency graph neural networks

## ๐Ÿค Contributing

Pull requests welcome.

Please run tests before submitting.

## ๐Ÿ“ฌ Contact

**Author:** Juan Petter (LGCY Labs)

- ๐Ÿ“ง [petter2025us@outlook.com](mailto:petter2025us@outlook.com)
- ๐Ÿ”— [linkedin.com/in/petterjuan](https://linkedin.com/in/petterjuan)
- ๐Ÿ“… [Book a session](https://calendly.com/petter2025us/30min)

## โญ Support

If this project helps you:

- โญ Star the repo
- ๐Ÿ”„ Share with your network
- ๐Ÿ› Report issues
- ๐Ÿ’ก Suggest features

<p align="center">
  <sub>Built with โค๏ธ for production reliability</sub>
</p>