โ๏ธ Agentic Reliability Framework
Adaptive anomaly detection + policy-driven self-healing for AI systems
Minimal, fast, and production-focused.
๐ง Agentic Reliability Framework
Autonomous Reliability Engineering for Production AI Systems
Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms.
โญ Key Features
- Real-time anomaly detection across latency, errors, throughput & resources
- Root-cause analysis with evidence correlation
- Predictive forecasting (15-minute lookahead)
- Automated healing policies (restart, rollback, scale, circuit break)
- Incident memory with FAISS for semantic recall
- Security hardened (all CVEs patched)
- Thread-safe, async, process-pooled architecture
- Sub-100ms end-to-end latency (p50)
๐ Security Hardening (v2.0)
| CVE | Severity | Component | Status |
|---|---|---|---|
| CVE-2025-23042 | 9.1 | Gradio Path Traversal | โ Patched |
| CVE-2025-48889 | 7.5 | Gradio SVG DOS | โ Patched |
| CVE-2025-5320 | 6.5 | Gradio File Override | โ Patched |
| CVE-2023-32681 | 6.1 | Requests Credential Leak | โ Patched |
| CVE-2024-47081 | 5.3 | Requests .netrc Leak | โ Patched |
Additional Hardening
- SHA-256 hashing everywhere (no MD5)
- Pydantic v2 input validation
- Rate limiting (60 req/min/user)
- Atomic operations w/ thread-safe FAISS single-writer pattern
- Lock-free reads for high throughput
โก Lock-Free Reads for High Throughput
By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load.
Performance Impact
| Metric | Before | After | ฮ |
|---|---|---|---|
| Event Processing (p50) | ~350ms | ~100ms | โก 71% faster |
| Event Processing (p99) | ~800ms | ~250ms | โก 69% faster |
| Agent Orchestration | Sequential | Parallel | 3ร throughput |
| Memory Behavior | Growing | Stable / Bounded | 0 leaks |
๐งฉ Architecture Overview
System Flow
Your Production System
(APIs, Databases, Microservices)
โ
Agentic Reliability Core
Detect โ Diagnose โ Predict
โ
Agents:
๐ต๏ธ Detective Agent โ Anomaly detection
๐ Diagnostician Agent โ Root cause analysis
๐ฎ Predictive Agent โ Forecasting / risk estimation
โ
Policy Engine (Auto-Healing)
โ
Healing Actions:
โข Restart
โข Scale
โข Rollback
โข Circuit-break
๐๏ธ Core Framework Components
Web Framework & UI
- Gradio 5.50+ - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860)
- Python 3.10+ - Core implementation with asynchronous, thread-safe architecture
AI/ML Stack
- FAISS-CPU 1.13.0 - Facebook AI Similarity Search for persistent incident memory and vector operations
- SentenceTransformers 5.1.1 - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis
- NumPy 1.26.4 - Numerical computing foundation for vector operations and data processing
Data & HTTP Layer
- Pydantic 2.11+ - Type-safe data modeling with frozen models for immutability and runtime validation
- Requests 2.32.5 - HTTP client library for external API communication (security patched)
Reliability & Resilience
- CircuitBreaker 2.0+ - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention
- AtomicWrites 1.4.1 - Atomic file operations ensuring data consistency and durability
๐ฏ Architecture Pattern
ARF implements a Multi-Agent Orchestration Pattern with three specialized agents:
- Detective Agent - Anomaly detection
- Diagnostician Agent - Root cause analysis
- Predictive Agent - Future risk forecasting
All agents run in parallel (not sequential) for 3ร throughput improvement.
โก Performance Features
- Native async handlers (no event loop overhead)
- Thread-safe single-writer/multi-reader pattern for FAISS
- RLock-protected policy evaluation
- Queue-based writes to prevent race conditions
- Sub-100ms p50 latency at 100+ events/second
The framework combines Gradio for the web/UI layer, FAISS for vector memory, and SentenceTransformers for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability.
๐งช The Three Agents
๐ต๏ธ Detective Agent โ Anomaly Detection
Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade.
- Adaptive multi-metric scoring
- CPU/mem resource anomaly detection
- Latency & error spike detection
- Confidence scoring (0โ1)
๐ Diagnostician Agent (Root Cause Analysis)
Identifies patterns such as:
- DB connection pool exhaustion
- Dependency timeouts
- Resource saturation
- App-layer regressions
- Misconfigurations
๐ฎ Predictive Agent (Forecasting)
- 15-minute risk projection
- Trend analysis
- Time-to-failure estimates
- Risk levels: low โ critical
๐ Quick Start
1. Clone
git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework
2. Create environment
python3.10 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
3. Install
pip install -r requirements.txt
4. Start
python app.py
๐ Configuration
Create .env:
HF_TOKEN=your_token
DATA_DIR=./data
INDEX_FILE=data/incident_vectors.index
LOG_LEVEL=INFO
HOST=0.0.0.0
PORT=7860
Note: HF_TOKEN is optional and used for downloading SentenceTransformer models from Hugging Face Hub.
๐งฉ Custom Healing Policies
custom = HealingPolicy(
name="custom_latency",
conditions=[PolicyCondition("latency_p99", "gt", 200)],
actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM],
priority=1,
cool_down_seconds=300,
max_executions_per_hour=5,
)
๐ณ Docker Deployment
Dockerfile and docker-compose.yml included.
docker-compose up -d
๐ Performance Benchmarks
On Intel i7, 16GB RAM:
| Component | p50 | p99 |
|---|---|---|
| Total End-to-End | ~100ms | ~250ms |
| Policy Engine | 19ms | 38ms |
| Vector Encoding | 15ms | 30ms |
Stable memory: ~250MB
Throughput: 100+ events/sec
๐งช Testing
Production Dependencies
pip install -r requirements.txt
Development Dependencies
pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy
Run Tests
pytest tests/ -v --cov
Coverage: 87%
Includes:
- Unit tests
- Thread-safety tests
- Stress tests
- Integration tests
Code Quality
# Format code
black .
# Lint code
ruff check .
# Type checking
mypy app.py
๐บ Roadmap
v2.1
- Distributed FAISS
- Prometheus / Grafana
- Slack & PagerDuty integration
- Custom alerting DSL
v3.0
- Reinforcement learning for policy optimization
- LSTM forecasting
- Dependency graph neural networks
๐ค Contributing
Pull requests welcome.
Please run tests before submitting.
๐ฌ Contact
Author: Juan Petter (LGCY Labs)
- ๐ง petter2025us@outlook.com
- ๐ linkedin.com/in/petterjuan
- ๐ Book a session
โญ Support
If this project helps you:
- โญ Star the repo
- ๐ Share with your network
- ๐ Report issues
- ๐ก Suggest features
Built with โค๏ธ for production reliability