petter2025's picture
Upload README.md
b6a939e verified
|
raw
history blame
8.63 kB

Agentic Reliability Framework Banner

โš™๏ธ Agentic Reliability Framework

Adaptive anomaly detection + policy-driven self-healing for AI systems
Minimal, fast, and production-focused.

Python 3.10+ Status: MVP License: MIT

๐Ÿง  Agentic Reliability Framework

Autonomous Reliability Engineering for Production AI Systems

Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms.

โญ Key Features

  • Real-time anomaly detection across latency, errors, throughput & resources
  • Root-cause analysis with evidence correlation
  • Predictive forecasting (15-minute lookahead)
  • Automated healing policies (restart, rollback, scale, circuit break)
  • Incident memory with FAISS for semantic recall
  • Security hardened (all CVEs patched)
  • Thread-safe, async, process-pooled architecture
  • Sub-100ms end-to-end latency (p50)

๐Ÿ” Security Hardening (v2.0)

CVE Severity Component Status
CVE-2025-23042 9.1 Gradio Path Traversal โœ… Patched
CVE-2025-48889 7.5 Gradio SVG DOS โœ… Patched
CVE-2025-5320 6.5 Gradio File Override โœ… Patched
CVE-2023-32681 6.1 Requests Credential Leak โœ… Patched
CVE-2024-47081 5.3 Requests .netrc Leak โœ… Patched

Additional Hardening

  • SHA-256 hashing everywhere (no MD5)
  • Pydantic v2 input validation
  • Rate limiting (60 req/min/user)
  • Atomic operations w/ thread-safe FAISS single-writer pattern
  • Lock-free reads for high throughput

โšก Lock-Free Reads for High Throughput

By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load.

Performance Impact

Metric Before After ฮ”
Event Processing (p50) ~350ms ~100ms โšก 71% faster
Event Processing (p99) ~800ms ~250ms โšก 69% faster
Agent Orchestration Sequential Parallel 3ร— throughput
Memory Behavior Growing Stable / Bounded 0 leaks

๐Ÿงฉ Architecture Overview

System Flow

Your Production System
(APIs, Databases, Microservices)
           โ†“
  Agentic Reliability Core
  Detect โ†’ Diagnose โ†’ Predict
           โ†“
        Agents:
  ๐Ÿ•ต๏ธ Detective Agent โ€“ Anomaly detection
  ๐Ÿ” Diagnostician Agent โ€“ Root cause analysis
  ๐Ÿ”ฎ Predictive Agent โ€“ Forecasting / risk estimation
           โ†“
    Policy Engine (Auto-Healing)
           โ†“
    Healing Actions:
    โ€ข Restart
    โ€ข Scale
    โ€ข Rollback
    โ€ข Circuit-break

๐Ÿ—๏ธ Core Framework Components

Web Framework & UI

  • Gradio 5.50+ - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860)
  • Python 3.10+ - Core implementation with asynchronous, thread-safe architecture

AI/ML Stack

  • FAISS-CPU 1.13.0 - Facebook AI Similarity Search for persistent incident memory and vector operations
  • SentenceTransformers 5.1.1 - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis
  • NumPy 1.26.4 - Numerical computing foundation for vector operations and data processing

Data & HTTP Layer

  • Pydantic 2.11+ - Type-safe data modeling with frozen models for immutability and runtime validation
  • Requests 2.32.5 - HTTP client library for external API communication (security patched)

Reliability & Resilience

  • CircuitBreaker 2.0+ - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention
  • AtomicWrites 1.4.1 - Atomic file operations ensuring data consistency and durability

๐ŸŽฏ Architecture Pattern

ARF implements a Multi-Agent Orchestration Pattern with three specialized agents:

  • Detective Agent - Anomaly detection
  • Diagnostician Agent - Root cause analysis
  • Predictive Agent - Future risk forecasting

All agents run in parallel (not sequential) for 3ร— throughput improvement.

โšก Performance Features

  • Native async handlers (no event loop overhead)
  • Thread-safe single-writer/multi-reader pattern for FAISS
  • RLock-protected policy evaluation
  • Queue-based writes to prevent race conditions
  • Sub-100ms p50 latency at 100+ events/second

The framework combines Gradio for the web/UI layer, FAISS for vector memory, and SentenceTransformers for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability.

๐Ÿงช The Three Agents

๐Ÿ•ต๏ธ Detective Agent โ€” Anomaly Detection

Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade.

  • Adaptive multi-metric scoring
  • CPU/mem resource anomaly detection
  • Latency & error spike detection
  • Confidence scoring (0โ€“1)

๐Ÿ” Diagnostician Agent (Root Cause Analysis)

Identifies patterns such as:

  • DB connection pool exhaustion
  • Dependency timeouts
  • Resource saturation
  • App-layer regressions
  • Misconfigurations

๐Ÿ”ฎ Predictive Agent (Forecasting)

  • 15-minute risk projection
  • Trend analysis
  • Time-to-failure estimates
  • Risk levels: low โ†’ critical

๐Ÿš€ Quick Start

1. Clone

git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework

2. Create environment

python3.10 -m venv venv
source venv/bin/activate     # Windows: venv\Scripts\activate

3. Install

pip install -r requirements.txt

4. Start

python app.py

UI: http://localhost:7860

๐Ÿ›  Configuration

Create .env:

HF_TOKEN=your_token
DATA_DIR=./data
INDEX_FILE=data/incident_vectors.index
LOG_LEVEL=INFO
HOST=0.0.0.0
PORT=7860

Note: HF_TOKEN is optional and used for downloading SentenceTransformer models from Hugging Face Hub.

๐Ÿงฉ Custom Healing Policies

custom = HealingPolicy(
    name="custom_latency",
    conditions=[PolicyCondition("latency_p99", "gt", 200)],
    actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM],
    priority=1,
    cool_down_seconds=300,
    max_executions_per_hour=5,
)

๐Ÿณ Docker Deployment

Dockerfile and docker-compose.yml included.

docker-compose up -d

๐Ÿ“ˆ Performance Benchmarks

On Intel i7, 16GB RAM:

Component p50 p99
Total End-to-End ~100ms ~250ms
Policy Engine 19ms 38ms
Vector Encoding 15ms 30ms

Stable memory: ~250MB
Throughput: 100+ events/sec

๐Ÿงช Testing

Production Dependencies

pip install -r requirements.txt

Development Dependencies

pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy

Run Tests

pytest tests/ -v --cov

Coverage: 87%

Includes:

  • Unit tests
  • Thread-safety tests
  • Stress tests
  • Integration tests

Code Quality

# Format code
black .

# Lint code
ruff check .

# Type checking
mypy app.py

๐Ÿ—บ Roadmap

v2.1

  • Distributed FAISS
  • Prometheus / Grafana
  • Slack & PagerDuty integration
  • Custom alerting DSL

v3.0

  • Reinforcement learning for policy optimization
  • LSTM forecasting
  • Dependency graph neural networks

๐Ÿค Contributing

Pull requests welcome.

Please run tests before submitting.

๐Ÿ“ฌ Contact

Author: Juan Petter (LGCY Labs)

โญ Support

If this project helps you:

  • โญ Star the repo
  • ๐Ÿ”„ Share with your network
  • ๐Ÿ› Report issues
  • ๐Ÿ’ก Suggest features

Built with โค๏ธ for production reliability