| <p align="center"> |
| <img src="https://dummyimage.com/1200x260/000/fff&text=AGENTIC+RELIABILITY+FRAMEWORK" width="100%" alt="Agentic Reliability Framework Banner" /> |
| </p> |
|
|
| <h1 align="center">โ๏ธ Agentic Reliability Framework</h1> |
|
|
| <p align="center"> |
| <strong>Adaptive anomaly detection + policy-driven self-healing for AI systems</strong><br> |
| Minimal, fast, and production-focused. |
| </p> |
|
|
| <p align="center"> |
| <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.10+-blue" alt="Python 3.10+"></a> |
| <a href="#"><img src="https://img.shields.io/badge/status-MVP-green" alt="Status: MVP"></a> |
| <a href="#"><img src="https://img.shields.io/badge/license-MIT-lightgrey" alt="License: MIT"></a> |
| </p> |
|
|
| ## ๐ง Agentic Reliability Framework |
|
|
| **Autonomous Reliability Engineering for Production AI Systems** |
|
|
| Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms. |
|
|
| ## โญ Key Features |
|
|
| - **Real-time anomaly detection** across latency, errors, throughput & resources |
| - **Root-cause analysis** with evidence correlation |
| - **Predictive forecasting** (15-minute lookahead) |
| - **Automated healing policies** (restart, rollback, scale, circuit break) |
| - **Incident memory** with FAISS for semantic recall |
| - **Security hardened** (all CVEs patched) |
| - **Thread-safe, async, process-pooled architecture** |
| - **Sub-100ms end-to-end latency** (p50) |
|
|
| ## ๐ Security Hardening (v2.0) |
|
|
| | CVE | Severity | Component | Status | |
| |-----|----------|-----------|--------| |
| | CVE-2025-23042 | 9.1 | Gradio Path Traversal | โ
Patched | |
| | CVE-2025-48889 | 7.5 | Gradio SVG DOS | โ
Patched | |
| | CVE-2025-5320 | 6.5 | Gradio File Override | โ
Patched | |
| | CVE-2023-32681 | 6.1 | Requests Credential Leak | โ
Patched | |
| | CVE-2024-47081 | 5.3 | Requests .netrc Leak | โ
Patched | |
|
|
| ### Additional Hardening |
|
|
| - SHA-256 hashing everywhere (no MD5) |
| - Pydantic v2 input validation |
| - Rate limiting (60 req/min/user) |
| - Atomic operations w/ thread-safe FAISS single-writer pattern |
| - Lock-free reads for high throughput |
|
|
| ## โก Lock-Free Reads for High Throughput |
|
|
| By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load. |
|
|
| ### Performance Impact |
|
|
| | Metric | Before | After | ฮ | |
| |--------|--------|-------|---| |
| | Event Processing (p50) | ~350ms | ~100ms | โก 71% faster | |
| | Event Processing (p99) | ~800ms | ~250ms | โก 69% faster | |
| | Agent Orchestration | Sequential | Parallel | 3ร throughput | |
| | Memory Behavior | Growing | Stable / Bounded | 0 leaks | |
|
|
| ## ๐งฉ Architecture Overview |
|
|
| ### System Flow |
|
|
| ``` |
| Your Production System |
| (APIs, Databases, Microservices) |
| โ |
| Agentic Reliability Core |
| Detect โ Diagnose โ Predict |
| โ |
| Agents: |
| ๐ต๏ธ Detective Agent โ Anomaly detection |
| ๐ Diagnostician Agent โ Root cause analysis |
| ๐ฎ Predictive Agent โ Forecasting / risk estimation |
| โ |
| Policy Engine (Auto-Healing) |
| โ |
| Healing Actions: |
| โข Restart |
| โข Scale |
| โข Rollback |
| โข Circuit-break |
| ``` |
|
|
| ## ๐๏ธ Core Framework Components |
|
|
| ### Web Framework & UI |
|
|
| - **Gradio 5.50+** - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860) |
| - **Python 3.10+** - Core implementation with asynchronous, thread-safe architecture |
|
|
| ### AI/ML Stack |
|
|
| - **FAISS-CPU 1.13.0** - Facebook AI Similarity Search for persistent incident memory and vector operations |
| - **SentenceTransformers 5.1.1** - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis |
| - **NumPy 1.26.4** - Numerical computing foundation for vector operations and data processing |
|
|
| ### Data & HTTP Layer |
|
|
| - **Pydantic 2.11+** - Type-safe data modeling with frozen models for immutability and runtime validation |
| - **Requests 2.32.5** - HTTP client library for external API communication (security patched) |
|
|
| ### Reliability & Resilience |
|
|
| - **CircuitBreaker 2.0+** - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention |
| - **AtomicWrites 1.4.1** - Atomic file operations ensuring data consistency and durability |
|
|
| ## ๐ฏ Architecture Pattern |
|
|
| ARF implements a **Multi-Agent Orchestration Pattern** with three specialized agents: |
|
|
| - **Detective Agent** - Anomaly detection |
| - **Diagnostician Agent** - Root cause analysis |
| - **Predictive Agent** - Future risk forecasting |
|
|
| All agents run in **parallel** (not sequential) for **3ร throughput improvement**. |
|
|
| ### โก Performance Features |
|
|
| - Native async handlers (no event loop overhead) |
| - Thread-safe single-writer/multi-reader pattern for FAISS |
| - RLock-protected policy evaluation |
| - Queue-based writes to prevent race conditions |
| - Sub-100ms p50 latency at 100+ events/second |
|
|
| The framework combines **Gradio** for the web/UI layer, **FAISS** for vector memory, and **SentenceTransformers** for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability. |
|
|
| ## ๐งช The Three Agents |
|
|
| ### ๐ต๏ธ Detective Agent โ Anomaly Detection |
|
|
| Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade. |
|
|
| - Adaptive multi-metric scoring |
| - CPU/mem resource anomaly detection |
| - Latency & error spike detection |
| - Confidence scoring (0โ1) |
|
|
| ### ๐ Diagnostician Agent (Root Cause Analysis) |
|
|
| Identifies patterns such as: |
|
|
| - DB connection pool exhaustion |
| - Dependency timeouts |
| - Resource saturation |
| - App-layer regressions |
| - Misconfigurations |
|
|
| ### ๐ฎ Predictive Agent (Forecasting) |
|
|
| - 15-minute risk projection |
| - Trend analysis |
| - Time-to-failure estimates |
| - Risk levels: low โ critical |
|
|
| ## ๐ Quick Start |
|
|
| ### 1. Clone |
|
|
| ```bash |
| git clone https://github.com/petterjuan/agentic-reliability-framework.git |
| cd agentic-reliability-framework |
| ``` |
|
|
| ### 2. Create environment |
|
|
| ```bash |
| python3.10 -m venv venv |
| source venv/bin/activate # Windows: venv\Scripts\activate |
| ``` |
|
|
| ### 3. Install |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 4. Start |
|
|
| ```bash |
| python app.py |
| ``` |
|
|
| **UI:** http://localhost:7860 |
|
|
| ## ๐ Configuration |
|
|
| Create `.env`: |
|
|
| ```env |
| HF_TOKEN=your_token |
| DATA_DIR=./data |
| INDEX_FILE=data/incident_vectors.index |
| LOG_LEVEL=INFO |
| HOST=0.0.0.0 |
| PORT=7860 |
| ``` |
|
|
| **Note:** `HF_TOKEN` is optional and used for downloading SentenceTransformer models from Hugging Face Hub. |
|
|
| ## ๐งฉ Custom Healing Policies |
|
|
| ```python |
| custom = HealingPolicy( |
| name="custom_latency", |
| conditions=[PolicyCondition("latency_p99", "gt", 200)], |
| actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM], |
| priority=1, |
| cool_down_seconds=300, |
| max_executions_per_hour=5, |
| ) |
| ``` |
|
|
| ## ๐ณ Docker Deployment |
|
|
| Dockerfile and docker-compose.yml included. |
|
|
| ```bash |
| docker-compose up -d |
| ``` |
|
|
| ## ๐ Performance Benchmarks |
|
|
| **On Intel i7, 16GB RAM:** |
|
|
| | Component | p50 | p99 | |
| |-----------|-----|-----| |
| | Total End-to-End | ~100ms | ~250ms | |
| | Policy Engine | 19ms | 38ms | |
| | Vector Encoding | 15ms | 30ms | |
|
|
| **Stable memory:** ~250MB |
| **Throughput:** 100+ events/sec |
|
|
| ## ๐งช Testing |
|
|
| ### Production Dependencies |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Development Dependencies |
|
|
| ```bash |
| pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy |
| ``` |
|
|
| ### Run Tests |
|
|
| ```bash |
| pytest tests/ -v --cov |
| ``` |
|
|
| **Coverage:** 87% |
|
|
| Includes: |
| - Unit tests |
| - Thread-safety tests |
| - Stress tests |
| - Integration tests |
|
|
| ### Code Quality |
|
|
| ```bash |
| # Format code |
| black . |
| |
| # Lint code |
| ruff check . |
| |
| # Type checking |
| mypy app.py |
| ``` |
|
|
| ## ๐บ Roadmap |
|
|
| ### v2.1 |
|
|
| - Distributed FAISS |
| - Prometheus / Grafana |
| - Slack & PagerDuty integration |
| - Custom alerting DSL |
|
|
| ### v3.0 |
|
|
| - Reinforcement learning for policy optimization |
| - LSTM forecasting |
| - Dependency graph neural networks |
|
|
| ## ๐ค Contributing |
|
|
| Pull requests welcome. |
|
|
| Please run tests before submitting. |
|
|
| ## ๐ฌ Contact |
|
|
| **Author:** Juan Petter (LGCY Labs) |
|
|
| - ๐ง [petter2025us@outlook.com](mailto:petter2025us@outlook.com) |
| - ๐ [linkedin.com/in/petterjuan](https://linkedin.com/in/petterjuan) |
| - ๐
[Book a session](https://calendly.com/petter2025us/30min) |
|
|
| ## โญ Support |
|
|
| If this project helps you: |
|
|
| - โญ Star the repo |
| - ๐ Share with your network |
| - ๐ Report issues |
| - ๐ก Suggest features |
|
|
| <p align="center"> |
| <sub>Built with โค๏ธ for production reliability</sub> |
| </p> |