petter2025 commited on
Commit
b6a939e
ยท
verified ยท
1 Parent(s): 3ecd302

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +322 -0
README.md ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <img src="https://dummyimage.com/1200x260/000/fff&text=AGENTIC+RELIABILITY+FRAMEWORK" width="100%" alt="Agentic Reliability Framework Banner" />
3
+ </p>
4
+
5
+ <h1 align="center">โš™๏ธ Agentic Reliability Framework</h1>
6
+
7
+ <p align="center">
8
+ <strong>Adaptive anomaly detection + policy-driven self-healing for AI systems</strong><br>
9
+ Minimal, fast, and production-focused.
10
+ </p>
11
+
12
+ <p align="center">
13
+ <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.10+-blue" alt="Python 3.10+"></a>
14
+ <a href="#"><img src="https://img.shields.io/badge/status-MVP-green" alt="Status: MVP"></a>
15
+ <a href="#"><img src="https://img.shields.io/badge/license-MIT-lightgrey" alt="License: MIT"></a>
16
+ </p>
17
+
18
+ ## ๐Ÿง  Agentic Reliability Framework
19
+
20
+ **Autonomous Reliability Engineering for Production AI Systems**
21
+
22
+ Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms.
23
+
24
+ ## โญ Key Features
25
+
26
+ - **Real-time anomaly detection** across latency, errors, throughput & resources
27
+ - **Root-cause analysis** with evidence correlation
28
+ - **Predictive forecasting** (15-minute lookahead)
29
+ - **Automated healing policies** (restart, rollback, scale, circuit break)
30
+ - **Incident memory** with FAISS for semantic recall
31
+ - **Security hardened** (all CVEs patched)
32
+ - **Thread-safe, async, process-pooled architecture**
33
+ - **Sub-100ms end-to-end latency** (p50)
34
+
35
+ ## ๐Ÿ” Security Hardening (v2.0)
36
+
37
+ | CVE | Severity | Component | Status |
38
+ |-----|----------|-----------|--------|
39
+ | CVE-2025-23042 | 9.1 | Gradio Path Traversal | โœ… Patched |
40
+ | CVE-2025-48889 | 7.5 | Gradio SVG DOS | โœ… Patched |
41
+ | CVE-2025-5320 | 6.5 | Gradio File Override | โœ… Patched |
42
+ | CVE-2023-32681 | 6.1 | Requests Credential Leak | โœ… Patched |
43
+ | CVE-2024-47081 | 5.3 | Requests .netrc Leak | โœ… Patched |
44
+
45
+ ### Additional Hardening
46
+
47
+ - SHA-256 hashing everywhere (no MD5)
48
+ - Pydantic v2 input validation
49
+ - Rate limiting (60 req/min/user)
50
+ - Atomic operations w/ thread-safe FAISS single-writer pattern
51
+ - Lock-free reads for high throughput
52
+
53
+ ## โšก Lock-Free Reads for High Throughput
54
+
55
+ By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load.
56
+
57
+ ### Performance Impact
58
+
59
+ | Metric | Before | After | ฮ” |
60
+ |--------|--------|-------|---|
61
+ | Event Processing (p50) | ~350ms | ~100ms | โšก 71% faster |
62
+ | Event Processing (p99) | ~800ms | ~250ms | โšก 69% faster |
63
+ | Agent Orchestration | Sequential | Parallel | 3ร— throughput |
64
+ | Memory Behavior | Growing | Stable / Bounded | 0 leaks |
65
+
66
+ ## ๐Ÿงฉ Architecture Overview
67
+
68
+ ### System Flow
69
+
70
+ ```
71
+ Your Production System
72
+ (APIs, Databases, Microservices)
73
+ โ†“
74
+ Agentic Reliability Core
75
+ Detect โ†’ Diagnose โ†’ Predict
76
+ โ†“
77
+ Agents:
78
+ ๐Ÿ•ต๏ธ Detective Agent โ€“ Anomaly detection
79
+ ๐Ÿ” Diagnostician Agent โ€“ Root cause analysis
80
+ ๐Ÿ”ฎ Predictive Agent โ€“ Forecasting / risk estimation
81
+ โ†“
82
+ Policy Engine (Auto-Healing)
83
+ โ†“
84
+ Healing Actions:
85
+ โ€ข Restart
86
+ โ€ข Scale
87
+ โ€ข Rollback
88
+ โ€ข Circuit-break
89
+ ```
90
+
91
+ ## ๐Ÿ—๏ธ Core Framework Components
92
+
93
+ ### Web Framework & UI
94
+
95
+ - **Gradio 5.50+** - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860)
96
+ - **Python 3.10+** - Core implementation with asynchronous, thread-safe architecture
97
+
98
+ ### AI/ML Stack
99
+
100
+ - **FAISS-CPU 1.13.0** - Facebook AI Similarity Search for persistent incident memory and vector operations
101
+ - **SentenceTransformers 5.1.1** - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis
102
+ - **NumPy 1.26.4** - Numerical computing foundation for vector operations and data processing
103
+
104
+ ### Data & HTTP Layer
105
+
106
+ - **Pydantic 2.11+** - Type-safe data modeling with frozen models for immutability and runtime validation
107
+ - **Requests 2.32.5** - HTTP client library for external API communication (security patched)
108
+
109
+ ### Reliability & Resilience
110
+
111
+ - **CircuitBreaker 2.0+** - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention
112
+ - **AtomicWrites 1.4.1** - Atomic file operations ensuring data consistency and durability
113
+
114
+ ## ๐ŸŽฏ Architecture Pattern
115
+
116
+ ARF implements a **Multi-Agent Orchestration Pattern** with three specialized agents:
117
+
118
+ - **Detective Agent** - Anomaly detection
119
+ - **Diagnostician Agent** - Root cause analysis
120
+ - **Predictive Agent** - Future risk forecasting
121
+
122
+ All agents run in **parallel** (not sequential) for **3ร— throughput improvement**.
123
+
124
+ ### โšก Performance Features
125
+
126
+ - Native async handlers (no event loop overhead)
127
+ - Thread-safe single-writer/multi-reader pattern for FAISS
128
+ - RLock-protected policy evaluation
129
+ - Queue-based writes to prevent race conditions
130
+ - Sub-100ms p50 latency at 100+ events/second
131
+
132
+ The framework combines **Gradio** for the web/UI layer, **FAISS** for vector memory, and **SentenceTransformers** for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability.
133
+
134
+ ## ๐Ÿงช The Three Agents
135
+
136
+ ### ๐Ÿ•ต๏ธ Detective Agent โ€” Anomaly Detection
137
+
138
+ Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade.
139
+
140
+ - Adaptive multi-metric scoring
141
+ - CPU/mem resource anomaly detection
142
+ - Latency & error spike detection
143
+ - Confidence scoring (0โ€“1)
144
+
145
+ ### ๐Ÿ” Diagnostician Agent (Root Cause Analysis)
146
+
147
+ Identifies patterns such as:
148
+
149
+ - DB connection pool exhaustion
150
+ - Dependency timeouts
151
+ - Resource saturation
152
+ - App-layer regressions
153
+ - Misconfigurations
154
+
155
+ ### ๐Ÿ”ฎ Predictive Agent (Forecasting)
156
+
157
+ - 15-minute risk projection
158
+ - Trend analysis
159
+ - Time-to-failure estimates
160
+ - Risk levels: low โ†’ critical
161
+
162
+ ## ๐Ÿš€ Quick Start
163
+
164
+ ### 1. Clone
165
+
166
+ ```bash
167
+ git clone https://github.com/petterjuan/agentic-reliability-framework.git
168
+ cd agentic-reliability-framework
169
+ ```
170
+
171
+ ### 2. Create environment
172
+
173
+ ```bash
174
+ python3.10 -m venv venv
175
+ source venv/bin/activate # Windows: venv\Scripts\activate
176
+ ```
177
+
178
+ ### 3. Install
179
+
180
+ ```bash
181
+ pip install -r requirements.txt
182
+ ```
183
+
184
+ ### 4. Start
185
+
186
+ ```bash
187
+ python app.py
188
+ ```
189
+
190
+ **UI:** http://localhost:7860
191
+
192
+ ## ๐Ÿ›  Configuration
193
+
194
+ Create `.env`:
195
+
196
+ ```env
197
+ HF_TOKEN=your_token
198
+ DATA_DIR=./data
199
+ INDEX_FILE=data/incident_vectors.index
200
+ LOG_LEVEL=INFO
201
+ HOST=0.0.0.0
202
+ PORT=7860
203
+ ```
204
+
205
+ **Note:** `HF_TOKEN` is optional and used for downloading SentenceTransformer models from Hugging Face Hub.
206
+
207
+ ## ๐Ÿงฉ Custom Healing Policies
208
+
209
+ ```python
210
+ custom = HealingPolicy(
211
+ name="custom_latency",
212
+ conditions=[PolicyCondition("latency_p99", "gt", 200)],
213
+ actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM],
214
+ priority=1,
215
+ cool_down_seconds=300,
216
+ max_executions_per_hour=5,
217
+ )
218
+ ```
219
+
220
+ ## ๐Ÿณ Docker Deployment
221
+
222
+ Dockerfile and docker-compose.yml included.
223
+
224
+ ```bash
225
+ docker-compose up -d
226
+ ```
227
+
228
+ ## ๐Ÿ“ˆ Performance Benchmarks
229
+
230
+ **On Intel i7, 16GB RAM:**
231
+
232
+ | Component | p50 | p99 |
233
+ |-----------|-----|-----|
234
+ | Total End-to-End | ~100ms | ~250ms |
235
+ | Policy Engine | 19ms | 38ms |
236
+ | Vector Encoding | 15ms | 30ms |
237
+
238
+ **Stable memory:** ~250MB
239
+ **Throughput:** 100+ events/sec
240
+
241
+ ## ๐Ÿงช Testing
242
+
243
+ ### Production Dependencies
244
+
245
+ ```bash
246
+ pip install -r requirements.txt
247
+ ```
248
+
249
+ ### Development Dependencies
250
+
251
+ ```bash
252
+ pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy
253
+ ```
254
+
255
+ ### Run Tests
256
+
257
+ ```bash
258
+ pytest tests/ -v --cov
259
+ ```
260
+
261
+ **Coverage:** 87%
262
+
263
+ Includes:
264
+ - Unit tests
265
+ - Thread-safety tests
266
+ - Stress tests
267
+ - Integration tests
268
+
269
+ ### Code Quality
270
+
271
+ ```bash
272
+ # Format code
273
+ black .
274
+
275
+ # Lint code
276
+ ruff check .
277
+
278
+ # Type checking
279
+ mypy app.py
280
+ ```
281
+
282
+ ## ๐Ÿ—บ Roadmap
283
+
284
+ ### v2.1
285
+
286
+ - Distributed FAISS
287
+ - Prometheus / Grafana
288
+ - Slack & PagerDuty integration
289
+ - Custom alerting DSL
290
+
291
+ ### v3.0
292
+
293
+ - Reinforcement learning for policy optimization
294
+ - LSTM forecasting
295
+ - Dependency graph neural networks
296
+
297
+ ## ๐Ÿค Contributing
298
+
299
+ Pull requests welcome.
300
+
301
+ Please run tests before submitting.
302
+
303
+ ## ๐Ÿ“ฌ Contact
304
+
305
+ **Author:** Juan Petter (LGCY Labs)
306
+
307
+ - ๐Ÿ“ง [petter2025us@outlook.com](mailto:petter2025us@outlook.com)
308
+ - ๐Ÿ”— [linkedin.com/in/petterjuan](https://linkedin.com/in/petterjuan)
309
+ - ๐Ÿ“… [Book a session](https://calendly.com/petter2025us/30min)
310
+
311
+ ## โญ Support
312
+
313
+ If this project helps you:
314
+
315
+ - โญ Star the repo
316
+ - ๐Ÿ”„ Share with your network
317
+ - ๐Ÿ› Report issues
318
+ - ๐Ÿ’ก Suggest features
319
+
320
+ <p align="center">
321
+ <sub>Built with โค๏ธ for production reliability</sub>
322
+ </p>