Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

Agentic-Reliability-Framework-API / README.md

petter2025

Upload README.md

b6a939e verified 4 months ago

preview code

raw

history blame

8.63 kB

	<p align="center">
	<img src="https://dummyimage.com/1200x260/000/fff&text=AGENTIC+RELIABILITY+FRAMEWORK" width="100%" alt="Agentic Reliability Framework Banner" />
	</p>

	<h1 align="center">⚙️ Agentic Reliability Framework</h1>

	<p align="center">
	<strong>Adaptive anomaly detection + policy-driven self-healing for AI systems</strong><br>
	Minimal, fast, and production-focused.
	</p>

	<p align="center">
	<a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.10+-blue" alt="Python 3.10+"></a>
	<a href="#"><img src="https://img.shields.io/badge/status-MVP-green" alt="Status: MVP"></a>
	<a href="#"><img src="https://img.shields.io/badge/license-MIT-lightgrey" alt="License: MIT"></a>
	</p>

	## 🧠 Agentic Reliability Framework

	Autonomous Reliability Engineering for Production AI Systems

	Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms.

	## ⭐ Key Features

	- Real-time anomaly detection across latency, errors, throughput & resources
	- Root-cause analysis with evidence correlation
	- Predictive forecasting (15-minute lookahead)
	- Automated healing policies (restart, rollback, scale, circuit break)
	- Incident memory with FAISS for semantic recall
	- Security hardened (all CVEs patched)
	- Thread-safe, async, process-pooled architecture
	- Sub-100ms end-to-end latency (p50)

	## 🔐 Security Hardening (v2.0)

	\| CVE \| Severity \| Component \| Status \|
	\|-----\|----------\|-----------\|--------\|
	\| CVE-2025-23042 \| 9.1 \| Gradio Path Traversal \| ✅ Patched \|
	\| CVE-2025-48889 \| 7.5 \| Gradio SVG DOS \| ✅ Patched \|
	\| CVE-2025-5320 \| 6.5 \| Gradio File Override \| ✅ Patched \|
	\| CVE-2023-32681 \| 6.1 \| Requests Credential Leak \| ✅ Patched \|
	\| CVE-2024-47081 \| 5.3 \| Requests .netrc Leak \| ✅ Patched \|

	### Additional Hardening

	- SHA-256 hashing everywhere (no MD5)
	- Pydantic v2 input validation
	- Rate limiting (60 req/min/user)
	- Atomic operations w/ thread-safe FAISS single-writer pattern
	- Lock-free reads for high throughput

	## ⚡ Lock-Free Reads for High Throughput

	By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load.

	### Performance Impact

	\| Metric \| Before \| After \| Δ \|
	\|--------\|--------\|-------\|---\|
	\| Event Processing (p50) \| ~350ms \| ~100ms \| ⚡ 71% faster \|
	\| Event Processing (p99) \| ~800ms \| ~250ms \| ⚡ 69% faster \|
	\| Agent Orchestration \| Sequential \| Parallel \| 3× throughput \|
	\| Memory Behavior \| Growing \| Stable / Bounded \| 0 leaks \|

	## 🧩 Architecture Overview

	### System Flow

	```
	Your Production System
	(APIs, Databases, Microservices)
	↓
	Agentic Reliability Core
	Detect → Diagnose → Predict
	↓
	Agents:
	🕵️ Detective Agent – Anomaly detection
	🔍 Diagnostician Agent – Root cause analysis
	🔮 Predictive Agent – Forecasting / risk estimation
	↓
	Policy Engine (Auto-Healing)
	↓
	Healing Actions:
	• Restart
	• Scale
	• Rollback
	• Circuit-break
	```

	## 🏗️ Core Framework Components

	### Web Framework & UI

	- Gradio 5.50+ - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860)
	- Python 3.10+ - Core implementation with asynchronous, thread-safe architecture

	### AI/ML Stack

	- FAISS-CPU 1.13.0 - Facebook AI Similarity Search for persistent incident memory and vector operations
	- SentenceTransformers 5.1.1 - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis
	- NumPy 1.26.4 - Numerical computing foundation for vector operations and data processing

	### Data & HTTP Layer

	- Pydantic 2.11+ - Type-safe data modeling with frozen models for immutability and runtime validation
	- Requests 2.32.5 - HTTP client library for external API communication (security patched)

	### Reliability & Resilience

	- CircuitBreaker 2.0+ - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention
	- AtomicWrites 1.4.1 - Atomic file operations ensuring data consistency and durability

	## 🎯 Architecture Pattern

	ARF implements a Multi-Agent Orchestration Pattern with three specialized agents:

	- Detective Agent - Anomaly detection
	- Diagnostician Agent - Root cause analysis
	- Predictive Agent - Future risk forecasting

	All agents run in parallel (not sequential) for 3× throughput improvement.

	### ⚡ Performance Features

	- Native async handlers (no event loop overhead)
	- Thread-safe single-writer/multi-reader pattern for FAISS
	- RLock-protected policy evaluation
	- Queue-based writes to prevent race conditions
	- Sub-100ms p50 latency at 100+ events/second

	The framework combines Gradio for the web/UI layer, FAISS for vector memory, and SentenceTransformers for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability.

	## 🧪 The Three Agents

	### 🕵️ Detective Agent — Anomaly Detection

	Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade.

	- Adaptive multi-metric scoring
	- CPU/mem resource anomaly detection
	- Latency & error spike detection
	- Confidence scoring (0–1)

	### 🔍 Diagnostician Agent (Root Cause Analysis)

	Identifies patterns such as:

	- DB connection pool exhaustion
	- Dependency timeouts
	- Resource saturation
	- App-layer regressions
	- Misconfigurations

	### 🔮 Predictive Agent (Forecasting)

	- 15-minute risk projection
	- Trend analysis
	- Time-to-failure estimates
	- Risk levels: low → critical

	## 🚀 Quick Start

	### 1. Clone

	```bash
	git clone https://github.com/petterjuan/agentic-reliability-framework.git
	cd agentic-reliability-framework
	```

	### 2. Create environment

	```bash
	python3.10 -m venv venv
	source venv/bin/activate # Windows: venv\Scripts\activate
	```

	### 3. Install

	```bash
	pip install -r requirements.txt
	```

	### 4. Start

	```bash
	python app.py
	```

	UI: http://localhost:7860

	## 🛠 Configuration

	Create `.env`:

	```env
	HF_TOKEN=your_token
	DATA_DIR=./data
	INDEX_FILE=data/incident_vectors.index
	LOG_LEVEL=INFO
	HOST=0.0.0.0
	PORT=7860
	```

	Note: `HF_TOKEN` is optional and used for downloading SentenceTransformer models from Hugging Face Hub.

	## 🧩 Custom Healing Policies

	```python
	custom = HealingPolicy(
	name="custom_latency",
	conditions=[PolicyCondition("latency_p99", "gt", 200)],
	actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM],
	priority=1,
	cool_down_seconds=300,
	max_executions_per_hour=5,
	)
	```

	## 🐳 Docker Deployment

	Dockerfile and docker-compose.yml included.

	```bash
	docker-compose up -d
	```

	## 📈 Performance Benchmarks

	On Intel i7, 16GB RAM:

	\| Component \| p50 \| p99 \|
	\|-----------\|-----\|-----\|
	\| Total End-to-End \| ~100ms \| ~250ms \|
	\| Policy Engine \| 19ms \| 38ms \|
	\| Vector Encoding \| 15ms \| 30ms \|

	Stable memory: ~250MB
	Throughput: 100+ events/sec

	## 🧪 Testing

	### Production Dependencies

	```bash
	pip install -r requirements.txt
	```

	### Development Dependencies

	```bash
	pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy
	```

	### Run Tests

	```bash
	pytest tests/ -v --cov
	```

	Coverage: 87%

	Includes:
	- Unit tests
	- Thread-safety tests
	- Stress tests
	- Integration tests

	### Code Quality

	```bash
	# Format code
	black .

	# Lint code
	ruff check .

	# Type checking
	mypy app.py
	```

	## 🗺 Roadmap

	### v2.1

	- Distributed FAISS
	- Prometheus / Grafana
	- Slack & PagerDuty integration
	- Custom alerting DSL

	### v3.0

	- Reinforcement learning for policy optimization
	- LSTM forecasting
	- Dependency graph neural networks

	## 🤝 Contributing

	Pull requests welcome.

	Please run tests before submitting.

	## 📬 Contact

	Author: Juan Petter (LGCY Labs)

	- 📧 [petter2025us@outlook.com](mailto:petter2025us@outlook.com)
	- 🔗 [linkedin.com/in/petterjuan](https://linkedin.com/in/petterjuan)
	- 📅 [Book a session](https://calendly.com/petter2025us/30min)

	## ⭐ Support

	If this project helps you:

	- ⭐ Star the repo
	- 🔄 Share with your network
	- 🐛 Report issues
	- 💡 Suggest features

	<p align="center">
	<sub>Built with ❤️ for production reliability</sub>
	</p>