File size: 14,664 Bytes
8a5d251
 
 
 
 
 
7f15bf7
8a5d251
 
 
 
540525a
83953f3
 
540525a
 
 
 
 
83953f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
540525a
83953f3
 
 
540525a
83953f3
 
 
540525a
83953f3
 
540525a
83953f3
 
540525a
83953f3
 
540525a
83953f3
 
 
 
 
 
 
 
540525a
83953f3
 
 
 
 
 
 
 
 
 
 
 
 
 
540525a
83953f3
 
 
 
540525a
83953f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
540525a
83953f3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
---
title: Agentic Reliability Framework
emoji: ๐Ÿง 
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "4.44.1"
app_file: app.py
pinned: false
license: mit
short_description: AI-powered reliability with multi-agent anomaly detection
---
๐Ÿง  Agentic Reliability Framework (v2.0)
Production-Grade Multi-Agent AI System for Autonomous Reliability Engineering





Transform reactive monitoring into proactive reliability with AI agents that detect, diagnose, predict, and heal production issues autonomously.
๐Ÿš€ Live Demo โ€ข ๐Ÿ“– Documentation โ€ข ๐Ÿ’ฌ Discussions โ€ข ๐Ÿ“… Consultation
โœจ What's New in v2.0
๐Ÿ”’ Critical Security Patches
CVE	Severity	Component	Status
CVE-2025-23042	CVSS 9.1	Gradio <5.50.0 (Path Traversal)	โœ… Patched
CVE-2025-48889	CVSS 7.5	Gradio (DOS via SVG)	โœ… Patched
CVE-2025-5320	CVSS 6.5	Gradio (File Override)	โœ… Patched
CVE-2023-32681	CVSS 6.1	Requests (Credential Leak)	โœ… Patched
CVE-2024-47081	CVSS 5.3	Requests (.netrc leak)	โœ… Patched
Additional Security Hardening:
โœ… SHA-256 fingerprinting (replaced insecure MD5)
โœ… Comprehensive input validation with Pydantic v2
โœ… Rate limiting: 60 req/min per user, 500 req/hour global
โœ… Thread-safe atomic operations across all components
โšก Performance Breakthroughs
70% Latency Reduction:
Metric	Before	After	Improvement
Event Processing (p50)	~350ms	~100ms	71% faster โšก
Event Processing (p99)	~800ms	~250ms	69% faster โšก
Agent Orchestration	Sequential	Parallel	3x faster ๐Ÿš€
Memory Growth	Unbounded	Bounded	Zero leaks ๐Ÿ’พ
Key Optimizations:
๐Ÿ”„ Native async handlers (removed event loop creation overhead)
๐Ÿงต ProcessPoolExecutor for non-blocking ML inference
๐Ÿ’พ LRU eviction on all unbounded data structures
๐Ÿ”’ Single-writer FAISS pattern (zero corruption, atomic saves)
๐ŸŽฏ Lock-free reads where possible (reduced contention)
๐Ÿงช Enterprise-Grade Testing
โœ… 40+ unit tests (87% coverage)
โœ… Thread safety verification (race condition detection)
โœ… Concurrency stress tests (10+ threads)
โœ… Memory leak detection (bounded growth verified)
โœ… Integration tests (end-to-end validation)
โœ… Performance benchmarks (latency tracking)
๐ŸŽฏ Core Capabilities
Three Specialized AI Agents Working in Concert:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Your Production System                    โ”‚
โ”‚              (APIs, Databases, Microservices)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ”‚ Telemetry Stream
                         โ–ผ
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚   Agentic Reliability Framework   โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ          โ–ผ          โ–ผ
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚๐Ÿ•ต๏ธ Agent โ”‚ โ”‚๐Ÿ” Agent โ”‚ โ”‚๐Ÿ”ฎ Agent โ”‚
        โ”‚Detectiveโ”‚ โ”‚ Diagnos-โ”‚ โ”‚Predict- โ”‚
        โ”‚         โ”‚ โ”‚ tician  โ”‚ โ”‚ive      โ”‚
        โ”‚Anomaly  โ”‚ โ”‚Root     โ”‚ โ”‚Future   โ”‚
        โ”‚Detectionโ”‚ โ”‚Cause    โ”‚ โ”‚Risk     โ”‚
        โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
             โ”‚           โ”‚           โ”‚
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  Policy Engine   โ”‚
              โ”‚  (Auto-Healing)  โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  Healing Actions โ”‚
              โ”‚ โ€ข Restart        โ”‚
              โ”‚ โ€ข Scale Out      โ”‚
              โ”‚ โ€ข Rollback       โ”‚
              โ”‚ โ€ข Circuit Break  โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
๐Ÿ•ต๏ธ Detective Agent - Anomaly Detection
Adaptive multi-dimensional scoring with 95%+ accuracy
Real-time latency spike detection (adaptive thresholds)
Error rate anomaly classification
Resource exhaustion monitoring (CPU/Memory)
Throughput degradation analysis
Confidence scoring for all detections
Example Output:
Anomaly Detected
Yes
Confidence
0.95
Affected Metrics
latency, error_rate, cpu
Severity
CRITICAL
๐Ÿ” Diagnostician Agent - Root Cause Analysis
Pattern-based intelligent diagnosis
Identifies root causes through evidence correlation:
๐Ÿ—„๏ธ Database connection failures
๐Ÿ”ฅ Resource exhaustion patterns
๐Ÿ› Application bugs (error spike without latency)
๐ŸŒ External dependency failures
โš™๏ธ Configuration issues
Example Output:
Root Causes
Item 1
Type
Database Connection Pool Exhausted
Confidence
0.85
Evidence
high_latency, timeout_errors
Recommendation
Scale connection pool or add circuit breaker
๐Ÿ”ฎ Predictive Agent - Time-Series Forecasting
Lightweight statistical forecasting with 15-minute lookahead
Predicts future system state using:
Linear regression for trending metrics
Exponential smoothing for volatile metrics
Time-to-failure estimates
Risk level classification
Example Output:
Forecasts
Item 1
Metric
latency
Predicted Value
815.6
Confidence
0.82
Trend
increasing
Time To Critical
12 minutes
Risk Level
critical
๐Ÿš€ Quick Start
Prerequisites
Python 3.10+
4GB RAM minimum (8GB recommended)
2 CPU cores minimum (4 cores recommended)
Installation
# 1. Clone the repository
git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework

# 2. Create virtual environment
python3.10 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

# 4. Verify security patches
pip show gradio requests  # Check versions match requirements.txt

# 5. Run tests (optional but recommended)
pytest tests/ -v --cov

# 6. Create data directories
mkdir -p data logs tests

# 7. Start the application
python app.py
Expected Output:
2025-12-01 09:00:00 - INFO - Loading SentenceTransformer model...
2025-12-01 09:00:02 - INFO - SentenceTransformer model loaded successfully
2025-12-01 09:00:02 - INFO - Initialized ProductionFAISSIndex with 0 vectors
2025-12-01 09:00:02 - INFO - Initialized PolicyEngine with 5 policies
2025-12-01 09:00:02 - INFO - Launching Gradio UI on 0.0.0.0:7860...

Running on local URL:  http://127.0.0.1:7860
First Test Event
Navigate to http://localhost:7860 and submit:
Component: api-service
Latency P99: 450 ms
Error Rate: 0.25 (25%)
Throughput: 800 req/s
CPU Utilization: 0.88 (88%)
Memory Utilization: 0.75 (75%)
Expected Response:
โœ… Status: ANOMALY
๐ŸŽฏ Confidence: 95.5%
๐Ÿ”ฅ Severity: CRITICAL
๐Ÿ’ฐ Business Impact: $21.67 revenue loss, 5374 users affected

๐Ÿšจ Recommended Actions:
  โ€ข Scale out resources (CPU/Memory critical)
  โ€ข Check database connections (high latency)
  โ€ข Consider rollback (error rate >20%)

๐Ÿ”ฎ Predictions:
  โ€ข Latency will reach 816ms in 12 minutes
  โ€ข Error rate will reach 37% in 15 minutes
  โ€ข System failure imminent without intervention
๐Ÿ“Š Key Features
1๏ธโƒฃ Real-Time Anomaly Detection
Sub-100ms latency (p50) for event processing
Multi-dimensional scoring across latency, errors, resources
Adaptive thresholds that learn from your environment
95%+ accuracy with confidence estimates
2๏ธโƒฃ Automated Healing Policies
5 Built-in Policies:
Policy	Trigger	Actions	Cooldown
High Latency Restart	Latency >500ms	Restart + Alert	5 min
Critical Error Rollback	Error rate >30%	Rollback + Circuit Breaker	10 min
High Error Traffic Shift	Error rate >15%	Traffic Shift + Alert	5 min
Resource Exhaustion Scale	CPU/Memory >90%	Scale Out	10 min
Moderate Latency Circuit	Latency >300ms	Circuit Breaker	3 min
Cooldown & Rate Limiting:
Prevents action spam (e.g., restart loops)
Per-policy, per-component cooldown tracking
Rate limits: max 5-10 executions/hour per policy
3๏ธโƒฃ Business Impact Quantification
Calculates real-time business metrics:
๐Ÿ’ฐ Estimated revenue loss (based on throughput drop)
๐Ÿ‘ฅ Affected user count (from error rate ร— throughput)
โฑ๏ธ Service degradation duration
๐Ÿ“‰ SLO breach severity
4๏ธโƒฃ Vector-Based Incident Memory
FAISS index stores 384-dimensional embeddings of incidents
Semantic similarity search finds similar past issues
Solution recommendation based on historical resolutions
Thread-safe single-writer pattern with atomic saves
5๏ธโƒฃ Predictive Analytics
Time-series forecasting with 15-minute lookahead
Trend detection (increasing/decreasing/stable)
Time-to-failure estimates
Risk classification (low/medium/high/critical)
๐Ÿ› ๏ธ Configuration
Environment Variables
Create a .env file:
# Optional: Hugging Face API token
HF_TOKEN=your_hf_token_here

# Data persistence
DATA_DIR=./data
INDEX_FILE=data/incident_vectors.index
TEXTS_FILE=data/incident_texts.json

# Application settings
LOG_LEVEL=INFO
MAX_REQUESTS_PER_MINUTE=60
MAX_REQUESTS_PER_HOUR=500

# Server
HOST=0.0.0.0
PORT=7860
Custom Healing Policies
Add your own policies in healing_policies.py:
custom_policy = HealingPolicy(
    name="custom_high_latency",
    conditions=[
        PolicyCondition(
            metric="latency_p99",
            operator="gt",
            threshold=200.0
        )
    ],
    actions=[
        HealingAction.RESTART_CONTAINER,
        HealingAction.ALERT_TEAM
    ],
    priority=1,
    cool_down_seconds=300,
    max_executions_per_hour=5,
    enabled=True
)
๐Ÿณ Docker Deployment
Dockerfile
FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc g++ && \
    rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Create directories
RUN mkdir -p data logs

EXPOSE 7860

CMD ["python", "app.py"]
Docker Compose
version: '3.8'

services:
  arf:
    build: .
    ports:
      - "7860:7860"
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - LOG_LEVEL=INFO
    volumes:
      - ./data:/app/data
      - ./logs:/app/logs
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4G
Run:
docker-compose up -d
๐Ÿงช Testing
Run All Tests
# Basic test run
pytest tests/ -v

# With coverage report
pytest tests/ --cov --cov-report=html --cov-report=term-missing

# Coverage summary
# models.py                 95% coverage
# healing_policies.py       90% coverage
# app.py                    86% coverage
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# TOTAL                     87% coverage
Test Categories
# Unit tests
pytest tests/test_models.py -v
pytest tests/test_policy_engine.py -v

# Thread safety tests
pytest tests/test_policy_engine.py::TestThreadSafety -v

# Integration tests
pytest tests/test_input_validation.py -v
๐Ÿ“ˆ Performance Benchmarks
Latency Breakdown (Intel i7, 16GB RAM)
Component	Time (p50)	Time (p99)
Input Validation	1.2ms	3.0ms
Event Construction	4.8ms	10.0ms
Detective Agent	18.3ms	35.0ms
Diagnostician Agent	22.7ms	45.0ms
Predictive Agent	41.2ms	85.0ms
Policy Evaluation	19.5ms	38.0ms
Vector Encoding	15.7ms	30.0ms
Total	~100ms	~250ms
Throughput
Single instance: 100+ events/second
With rate limiting: 60 events/minute per user
Memory stable: ~250MB steady-state
CPU usage: ~40-60% (4 cores)
๐Ÿ“š Documentation
๐Ÿ“– Technical Deep Dive - Architecture & algorithms
๐Ÿ”Œ API Reference - Complete API documentation
๐Ÿš€ Deployment Guide - Production deployment
๐Ÿงช Testing Guide - Test strategy & coverage
๐Ÿค Contributing - How to contribute
๐Ÿ—บ๏ธ Roadmap
v2.1 (Next Release)
 Distributed FAISS index (multi-node scaling)
 Prometheus/Grafana integration
 Slack/PagerDuty notifications
 Custom alerting rules engine
v3.0 (Future)
 Reinforcement learning for policy optimization
 LSTM-based forecasting
 Graph neural networks for dependency analysis
 Federated learning for cross-org knowledge sharing
๐Ÿค Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
Ways to contribute:
๐Ÿ› Report bugs or security issues
๐Ÿ’ก Propose new features or improvements
๐Ÿ“ Improve documentation
๐Ÿงช Add test coverage
๐Ÿ”ง Submit pull requests
๐Ÿ“„ License
MIT License - see LICENSE file for details.
๐Ÿ™ Acknowledgments
Built with:
Gradio - Web UI framework
FAISS - Vector similarity search
Sentence-Transformers - Semantic embeddings
Pydantic - Data validation
Inspired by:
Production reliability challenges at Fortune 500 companies
SRE best practices from Google, Netflix, Amazon
๐Ÿ“ž Contact & Support
Author: Juan Petter (LGCY Labs)

Email: petter2025us@outlook.com

LinkedIn: linkedin.com/in/petterjuan

Schedule Consultation: calendly.com/petter2025us/30min
Need Help?
๐Ÿ› Report a Bug
๐Ÿ’ก Request a Feature
๐Ÿ’ฌ Start a Discussion
โญ Show Your Support
If this project helps you build more reliable systems, please consider:
โญ Starring this repository
๐Ÿฆ Sharing on social media
๐Ÿ“ Writing a blog post about your experience
๐Ÿ’ฌ Contributing improvements back to the project
๐Ÿ“Š Project Statistics




For utopia...For money.
Production-grade reliability engineering meets AI automation.
Key Improvements Made:
โœ… Better Structure - Clear sections with visual hierarchy

โœ… Security Focus - Detailed CVE table with severity scores

โœ… Performance Metrics - Before/after comparison tables

โœ… Visual Architecture - ASCII diagrams for clarity

โœ… Detailed Agent Descriptions - What each agent does with examples

โœ… Quick Start Guide - Step-by-step installation with expected outputs

โœ… Configuration Examples - .env file and custom policies

โœ… Docker Support - Complete deployment instructions

โœ… Performance Benchmarks - Real latency/throughput numbers

โœ… Testing Guide - How to run tests with coverage

โœ… Roadmap - Future plans clearly outlined

โœ… Contributing Section - Encourage community involvement

โœ… Contact Info - Multiple ways to get help