Agentic Reliability Framework Banner

โš™๏ธ Agentic Reliability Framework

Adaptive anomaly detection + policy-driven self-healing for AI systems
Minimal, fast, and production-focused.

Python 3.10+ Status: MVP License: MIT

## ๐Ÿง  Agentic Reliability Framework **Autonomous Reliability Engineering for Production AI Systems** Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms. ## โญ Key Features - **Real-time anomaly detection** across latency, errors, throughput & resources - **Root-cause analysis** with evidence correlation - **Predictive forecasting** (15-minute lookahead) - **Automated healing policies** (restart, rollback, scale, circuit break) - **Incident memory** with FAISS for semantic recall - **Security hardened** (all CVEs patched) - **Thread-safe, async, process-pooled architecture** - **Sub-100ms end-to-end latency** (p50) ## ๐Ÿ” Security Hardening (v2.0) | CVE | Severity | Component | Status | |-----|----------|-----------|--------| | CVE-2025-23042 | 9.1 | Gradio Path Traversal | โœ… Patched | | CVE-2025-48889 | 7.5 | Gradio SVG DOS | โœ… Patched | | CVE-2025-5320 | 6.5 | Gradio File Override | โœ… Patched | | CVE-2023-32681 | 6.1 | Requests Credential Leak | โœ… Patched | | CVE-2024-47081 | 5.3 | Requests .netrc Leak | โœ… Patched | ### Additional Hardening - SHA-256 hashing everywhere (no MD5) - Pydantic v2 input validation - Rate limiting (60 req/min/user) - Atomic operations w/ thread-safe FAISS single-writer pattern - Lock-free reads for high throughput ## โšก Lock-Free Reads for High Throughput By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load. ### Performance Impact | Metric | Before | After | ฮ” | |--------|--------|-------|---| | Event Processing (p50) | ~350ms | ~100ms | โšก 71% faster | | Event Processing (p99) | ~800ms | ~250ms | โšก 69% faster | | Agent Orchestration | Sequential | Parallel | 3ร— throughput | | Memory Behavior | Growing | Stable / Bounded | 0 leaks | ## ๐Ÿงฉ Architecture Overview ### System Flow ``` Your Production System (APIs, Databases, Microservices) โ†“ Agentic Reliability Core Detect โ†’ Diagnose โ†’ Predict โ†“ Agents: ๐Ÿ•ต๏ธ Detective Agent โ€“ Anomaly detection ๐Ÿ” Diagnostician Agent โ€“ Root cause analysis ๐Ÿ”ฎ Predictive Agent โ€“ Forecasting / risk estimation โ†“ Policy Engine (Auto-Healing) โ†“ Healing Actions: โ€ข Restart โ€ข Scale โ€ข Rollback โ€ข Circuit-break ``` ## ๐Ÿ—๏ธ Core Framework Components ### Web Framework & UI - **Gradio 5.50+** - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860) - **Python 3.10+** - Core implementation with asynchronous, thread-safe architecture ### AI/ML Stack - **FAISS-CPU 1.13.0** - Facebook AI Similarity Search for persistent incident memory and vector operations - **SentenceTransformers 5.1.1** - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis - **NumPy 1.26.4** - Numerical computing foundation for vector operations and data processing ### Data & HTTP Layer - **Pydantic 2.11+** - Type-safe data modeling with frozen models for immutability and runtime validation - **Requests 2.32.5** - HTTP client library for external API communication (security patched) ### Reliability & Resilience - **CircuitBreaker 2.0+** - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention - **AtomicWrites 1.4.1** - Atomic file operations ensuring data consistency and durability ## ๐ŸŽฏ Architecture Pattern ARF implements a **Multi-Agent Orchestration Pattern** with three specialized agents: - **Detective Agent** - Anomaly detection - **Diagnostician Agent** - Root cause analysis - **Predictive Agent** - Future risk forecasting All agents run in **parallel** (not sequential) for **3ร— throughput improvement**. ### โšก Performance Features - Native async handlers (no event loop overhead) - Thread-safe single-writer/multi-reader pattern for FAISS - RLock-protected policy evaluation - Queue-based writes to prevent race conditions - Sub-100ms p50 latency at 100+ events/second The framework combines **Gradio** for the web/UI layer, **FAISS** for vector memory, and **SentenceTransformers** for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability. ## ๐Ÿงช The Three Agents ### ๐Ÿ•ต๏ธ Detective Agent โ€” Anomaly Detection Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade. - Adaptive multi-metric scoring - CPU/mem resource anomaly detection - Latency & error spike detection - Confidence scoring (0โ€“1) ### ๐Ÿ” Diagnostician Agent (Root Cause Analysis) Identifies patterns such as: - DB connection pool exhaustion - Dependency timeouts - Resource saturation - App-layer regressions - Misconfigurations ### ๐Ÿ”ฎ Predictive Agent (Forecasting) - 15-minute risk projection - Trend analysis - Time-to-failure estimates - Risk levels: low โ†’ critical ## ๐Ÿš€ Quick Start ### 1. Clone ```bash git clone https://github.com/petterjuan/agentic-reliability-framework.git cd agentic-reliability-framework ``` ### 2. Create environment ```bash python3.10 -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate ``` ### 3. Install ```bash pip install -r requirements.txt ``` ### 4. Start ```bash python app.py ``` **UI:** http://localhost:7860 ## ๐Ÿ›  Configuration Create `.env`: ```env HF_TOKEN=your_token DATA_DIR=./data INDEX_FILE=data/incident_vectors.index LOG_LEVEL=INFO HOST=0.0.0.0 PORT=7860 ``` **Note:** `HF_TOKEN` is optional and used for downloading SentenceTransformer models from Hugging Face Hub. ## ๐Ÿงฉ Custom Healing Policies ```python custom = HealingPolicy( name="custom_latency", conditions=[PolicyCondition("latency_p99", "gt", 200)], actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM], priority=1, cool_down_seconds=300, max_executions_per_hour=5, ) ``` ## ๐Ÿณ Docker Deployment Dockerfile and docker-compose.yml included. ```bash docker-compose up -d ``` ## ๐Ÿ“ˆ Performance Benchmarks **On Intel i7, 16GB RAM:** | Component | p50 | p99 | |-----------|-----|-----| | Total End-to-End | ~100ms | ~250ms | | Policy Engine | 19ms | 38ms | | Vector Encoding | 15ms | 30ms | **Stable memory:** ~250MB **Throughput:** 100+ events/sec ## ๐Ÿงช Testing ### Production Dependencies ```bash pip install -r requirements.txt ``` ### Development Dependencies ```bash pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy ``` ### Run Tests ```bash pytest tests/ -v --cov ``` **Coverage:** 87% Includes: - Unit tests - Thread-safety tests - Stress tests - Integration tests ### Code Quality ```bash # Format code black . # Lint code ruff check . # Type checking mypy app.py ``` ## ๐Ÿ—บ Roadmap ### v2.1 - Distributed FAISS - Prometheus / Grafana - Slack & PagerDuty integration - Custom alerting DSL ### v3.0 - Reinforcement learning for policy optimization - LSTM forecasting - Dependency graph neural networks ## ๐Ÿค Contributing Pull requests welcome. Please run tests before submitting. ## ๐Ÿ“ฌ Contact **Author:** Juan Petter (LGCY Labs) - ๐Ÿ“ง [petter2025us@outlook.com](mailto:petter2025us@outlook.com) - ๐Ÿ”— [linkedin.com/in/petterjuan](https://linkedin.com/in/petterjuan) - ๐Ÿ“… [Book a session](https://calendly.com/petter2025us/30min) ## โญ Support If this project helps you: - โญ Star the repo - ๐Ÿ”„ Share with your network - ๐Ÿ› Report issues - ๐Ÿ’ก Suggest features

Built with โค๏ธ for production reliability