| --- |
| title: Agentic Reliability Framework |
| emoji: ๐ง |
| colorFrom: blue |
| colorTo: purple |
| sdk: gradio |
| sdk_version: "4.44.1" |
| app_file: app.py |
| pinned: false |
| license: mit |
| short_description: AI-powered reliability with multi-agent anomaly detection |
| --- |
| ๐ง Agentic Reliability Framework (v2.0) |
| Production-Grade Multi-Agent AI System for Autonomous Reliability Engineering |
|
|
|
|
|
|
|
|
|
|
| Transform reactive monitoring into proactive reliability with AI agents that detect, diagnose, predict, and heal production issues autonomously. |
| ๐ Live Demo โข ๐ Documentation โข ๐ฌ Discussions โข ๐
Consultation |
| โจ What's New in v2.0 |
| ๐ Critical Security Patches |
| CVE Severity Component Status |
| CVE-2025-23042 CVSS 9.1 Gradio <5.50.0 (Path Traversal) โ
Patched |
| CVE-2025-48889 CVSS 7.5 Gradio (DOS via SVG) โ
Patched |
| CVE-2025-5320 CVSS 6.5 Gradio (File Override) โ
Patched |
| CVE-2023-32681 CVSS 6.1 Requests (Credential Leak) โ
Patched |
| CVE-2024-47081 CVSS 5.3 Requests (.netrc leak) โ
Patched |
| Additional Security Hardening: |
| โ
SHA-256 fingerprinting (replaced insecure MD5) |
| โ
Comprehensive input validation with Pydantic v2 |
| โ
Rate limiting: 60 req/min per user, 500 req/hour global |
| โ
Thread-safe atomic operations across all components |
| โก Performance Breakthroughs |
| 70% Latency Reduction: |
| Metric Before After Improvement |
| Event Processing (p50) ~350ms ~100ms 71% faster โก |
| Event Processing (p99) ~800ms ~250ms 69% faster โก |
| Agent Orchestration Sequential Parallel 3x faster ๐ |
| Memory Growth Unbounded Bounded Zero leaks ๐พ |
| Key Optimizations: |
| ๐ Native async handlers (removed event loop creation overhead) |
| ๐งต ProcessPoolExecutor for non-blocking ML inference |
| ๐พ LRU eviction on all unbounded data structures |
| ๐ Single-writer FAISS pattern (zero corruption, atomic saves) |
| ๐ฏ Lock-free reads where possible (reduced contention) |
| ๐งช Enterprise-Grade Testing |
| โ
40+ unit tests (87% coverage) |
| โ
Thread safety verification (race condition detection) |
| โ
Concurrency stress tests (10+ threads) |
| โ
Memory leak detection (bounded growth verified) |
| โ
Integration tests (end-to-end validation) |
| โ
Performance benchmarks (latency tracking) |
| ๐ฏ Core Capabilities |
| Three Specialized AI Agents Working in Concert: |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ Your Production System โ |
| โ (APIs, Databases, Microservices) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ Telemetry Stream |
| โผ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ Agentic Reliability Framework โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ |
| โโโโโโโโโโโโผโโโโโโโโโโโ |
| โผ โผ โผ |
| โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ |
| โ๐ต๏ธ Agent โ โ๐ Agent โ โ๐ฎ Agent โ |
| โDetectiveโ โ Diagnos-โ โPredict- โ |
| โ โ โ tician โ โive โ |
| โAnomaly โ โRoot โ โFuture โ |
| โDetectionโ โCause โ โRisk โ |
| โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ |
| โ โ โ |
| โโโโโโโโโโโโโผโโโโโโโโโโโโ |
| โผ |
| โโโโโโโโโโโโโโโโโโโโ |
| โ Policy Engine โ |
| โ (Auto-Healing) โ |
| โโโโโโโโโโโโโโโโโโโโ |
| โผ |
| โโโโโโโโโโโโโโโโโโโโ |
| โ Healing Actions โ |
| โ โข Restart โ |
| โ โข Scale Out โ |
| โ โข Rollback โ |
| โ โข Circuit Break โ |
| โโโโโโโโโโโโโโโโโโโโ |
| ๐ต๏ธ Detective Agent - Anomaly Detection |
| Adaptive multi-dimensional scoring with 95%+ accuracy |
| Real-time latency spike detection (adaptive thresholds) |
| Error rate anomaly classification |
| Resource exhaustion monitoring (CPU/Memory) |
| Throughput degradation analysis |
| Confidence scoring for all detections |
| Example Output: |
| Anomaly Detected |
| Yes |
| Confidence |
| 0.95 |
| Affected Metrics |
| latency, error_rate, cpu |
| Severity |
| CRITICAL |
| ๐ Diagnostician Agent - Root Cause Analysis |
| Pattern-based intelligent diagnosis |
| Identifies root causes through evidence correlation: |
| ๐๏ธ Database connection failures |
| ๐ฅ Resource exhaustion patterns |
| ๐ Application bugs (error spike without latency) |
| ๐ External dependency failures |
| โ๏ธ Configuration issues |
| Example Output: |
| Root Causes |
| Item 1 |
| Type |
| Database Connection Pool Exhausted |
| Confidence |
| 0.85 |
| Evidence |
| high_latency, timeout_errors |
| Recommendation |
| Scale connection pool or add circuit breaker |
| ๐ฎ Predictive Agent - Time-Series Forecasting |
| Lightweight statistical forecasting with 15-minute lookahead |
| Predicts future system state using: |
| Linear regression for trending metrics |
| Exponential smoothing for volatile metrics |
| Time-to-failure estimates |
| Risk level classification |
| Example Output: |
| Forecasts |
| Item 1 |
| Metric |
| latency |
| Predicted Value |
| 815.6 |
| Confidence |
| 0.82 |
| Trend |
| increasing |
| Time To Critical |
| 12 minutes |
| Risk Level |
| critical |
| ๐ Quick Start |
| Prerequisites |
| Python 3.10+ |
| 4GB RAM minimum (8GB recommended) |
| 2 CPU cores minimum (4 cores recommended) |
| Installation |
| # 1. Clone the repository |
| git clone https://github.com/petterjuan/agentic-reliability-framework.git |
| cd agentic-reliability-framework |
| |
| # 2. Create virtual environment |
| python3.10 -m venv venv |
| source venv/bin/activate # Windows: venv\Scripts\activate |
|
|
| # 3. Install dependencies |
| pip install --upgrade pip |
| pip install -r requirements.txt |
|
|
| # 4. Verify security patches |
| pip show gradio requests # Check versions match requirements.txt |
|
|
| # 5. Run tests (optional but recommended) |
| pytest tests/ -v --cov |
|
|
| # 6. Create data directories |
| mkdir -p data logs tests |
|
|
| # 7. Start the application |
| python app.py |
| Expected Output: |
| 2025-12-01 09:00:00 - INFO - Loading SentenceTransformer model... |
| 2025-12-01 09:00:02 - INFO - SentenceTransformer model loaded successfully |
| 2025-12-01 09:00:02 - INFO - Initialized ProductionFAISSIndex with 0 vectors |
| 2025-12-01 09:00:02 - INFO - Initialized PolicyEngine with 5 policies |
| 2025-12-01 09:00:02 - INFO - Launching Gradio UI on 0.0.0.0:7860... |
|
|
| Running on local URL: http://127.0.0.1:7860 |
| First Test Event |
| Navigate to http://localhost:7860 and submit: |
| Component: api-service |
| Latency P99: 450 ms |
| Error Rate: 0.25 (25%) |
| Throughput: 800 req/s |
| CPU Utilization: 0.88 (88%) |
| Memory Utilization: 0.75 (75%) |
| Expected Response: |
| โ
Status: ANOMALY |
| ๐ฏ Confidence: 95.5% |
| ๐ฅ Severity: CRITICAL |
| ๐ฐ Business Impact: $21.67 revenue loss, 5374 users affected |
|
|
| ๐จ Recommended Actions: |
| โข Scale out resources (CPU/Memory critical) |
| โข Check database connections (high latency) |
| โข Consider rollback (error rate >20%) |
|
|
| ๐ฎ Predictions: |
| โข Latency will reach 816ms in 12 minutes |
| โข Error rate will reach 37% in 15 minutes |
| โข System failure imminent without intervention |
| ๐ Key Features |
| 1๏ธโฃ Real-Time Anomaly Detection |
| Sub-100ms latency (p50) for event processing |
| Multi-dimensional scoring across latency, errors, resources |
| Adaptive thresholds that learn from your environment |
| 95%+ accuracy with confidence estimates |
| 2๏ธโฃ Automated Healing Policies |
| 5 Built-in Policies: |
| Policy Trigger Actions Cooldown |
| High Latency Restart Latency >500ms Restart + Alert 5 min |
| Critical Error Rollback Error rate >30% Rollback + Circuit Breaker 10 min |
| High Error Traffic Shift Error rate >15% Traffic Shift + Alert 5 min |
| Resource Exhaustion Scale CPU/Memory >90% Scale Out 10 min |
| Moderate Latency Circuit Latency >300ms Circuit Breaker 3 min |
| Cooldown & Rate Limiting: |
| Prevents action spam (e.g., restart loops) |
| Per-policy, per-component cooldown tracking |
| Rate limits: max 5-10 executions/hour per policy |
| 3๏ธโฃ Business Impact Quantification |
| Calculates real-time business metrics: |
| ๐ฐ Estimated revenue loss (based on throughput drop) |
| ๐ฅ Affected user count (from error rate ร throughput) |
| โฑ๏ธ Service degradation duration |
| ๐ SLO breach severity |
| 4๏ธโฃ Vector-Based Incident Memory |
| FAISS index stores 384-dimensional embeddings of incidents |
| Semantic similarity search finds similar past issues |
| Solution recommendation based on historical resolutions |
| Thread-safe single-writer pattern with atomic saves |
| 5๏ธโฃ Predictive Analytics |
| Time-series forecasting with 15-minute lookahead |
| Trend detection (increasing/decreasing/stable) |
| Time-to-failure estimates |
| Risk classification (low/medium/high/critical) |
| ๐ ๏ธ Configuration |
| Environment Variables |
| Create a .env file: |
| # Optional: Hugging Face API token |
| HF_TOKEN=your_hf_token_here |
|
|
| # Data persistence |
| DATA_DIR=./data |
| INDEX_FILE=data/incident_vectors.index |
| TEXTS_FILE=data/incident_texts.json |
| |
| # Application settings |
| LOG_LEVEL=INFO |
| MAX_REQUESTS_PER_MINUTE=60 |
| MAX_REQUESTS_PER_HOUR=500 |
|
|
| # Server |
| HOST=0.0.0.0 |
| PORT=7860 |
| Custom Healing Policies |
| Add your own policies in healing_policies.py: |
| custom_policy = HealingPolicy( |
| name="custom_high_latency", |
| conditions=[ |
| PolicyCondition( |
| metric="latency_p99", |
| operator="gt", |
| threshold=200.0 |
| ) |
| ], |
| actions=[ |
| HealingAction.RESTART_CONTAINER, |
| HealingAction.ALERT_TEAM |
| ], |
| priority=1, |
| cool_down_seconds=300, |
| max_executions_per_hour=5, |
| enabled=True |
| ) |
| ๐ณ Docker Deployment |
| Dockerfile |
| FROM python:3.10-slim |
| |
| WORKDIR /app |
|
|
| # Install system dependencies |
| RUN apt-get update && apt-get install -y \ |
| gcc g++ && \ |
| rm -rf /var/lib/apt/lists/* |
| |
| # Copy and install Python dependencies |
| COPY requirements.txt . |
| RUN pip install --no-cache-dir -r requirements.txt |
|
|
| # Copy application |
| COPY . . |
|
|
| # Create directories |
| RUN mkdir -p data logs |
|
|
| EXPOSE 7860 |
|
|
| CMD ["python", "app.py"] |
| Docker Compose |
| version: '3.8' |
|
|
| services: |
| arf: |
| build: . |
| ports: |
| - "7860:7860" |
| environment: |
| - HF_TOKEN=${HF_TOKEN} |
| - LOG_LEVEL=INFO |
| volumes: |
| - ./data:/app/data |
| - ./logs:/app/logs |
| restart: unless-stopped |
| deploy: |
| resources: |
| limits: |
| cpus: '4' |
| memory: 4G |
| Run: |
| docker-compose up -d |
| ๐งช Testing |
| Run All Tests |
| # Basic test run |
| pytest tests/ -v |
| |
| # With coverage report |
| pytest tests/ --cov --cov-report=html --cov-report=term-missing |
|
|
| # Coverage summary |
| # models.py 95% coverage |
| # healing_policies.py 90% coverage |
| # app.py 86% coverage |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| # TOTAL 87% coverage |
| Test Categories |
| # Unit tests |
| pytest tests/test_models.py -v |
| pytest tests/test_policy_engine.py -v |
|
|
| # Thread safety tests |
| pytest tests/test_policy_engine.py::TestThreadSafety -v |
|
|
| # Integration tests |
| pytest tests/test_input_validation.py -v |
| ๐ Performance Benchmarks |
| Latency Breakdown (Intel i7, 16GB RAM) |
| Component Time (p50) Time (p99) |
| Input Validation 1.2ms 3.0ms |
| Event Construction 4.8ms 10.0ms |
| Detective Agent 18.3ms 35.0ms |
| Diagnostician Agent 22.7ms 45.0ms |
| Predictive Agent 41.2ms 85.0ms |
| Policy Evaluation 19.5ms 38.0ms |
| Vector Encoding 15.7ms 30.0ms |
| Total ~100ms ~250ms |
| Throughput |
| Single instance: 100+ events/second |
| With rate limiting: 60 events/minute per user |
| Memory stable: ~250MB steady-state |
| CPU usage: ~40-60% (4 cores) |
| ๐ Documentation |
| ๐ Technical Deep Dive - Architecture & algorithms |
| ๐ API Reference - Complete API documentation |
| ๐ Deployment Guide - Production deployment |
| ๐งช Testing Guide - Test strategy & coverage |
| ๐ค Contributing - How to contribute |
| ๐บ๏ธ Roadmap |
| v2.1 (Next Release) |
| Distributed FAISS index (multi-node scaling) |
| Prometheus/Grafana integration |
| Slack/PagerDuty notifications |
| Custom alerting rules engine |
| v3.0 (Future) |
| Reinforcement learning for policy optimization |
| LSTM-based forecasting |
| Graph neural networks for dependency analysis |
| Federated learning for cross-org knowledge sharing |
| ๐ค Contributing |
| We welcome contributions! See CONTRIBUTING.md for guidelines. |
| Ways to contribute: |
| ๐ Report bugs or security issues |
| ๐ก Propose new features or improvements |
| ๐ Improve documentation |
| ๐งช Add test coverage |
| ๐ง Submit pull requests |
| ๐ License |
| MIT License - see LICENSE file for details. |
| ๐ Acknowledgments |
| Built with: |
| Gradio - Web UI framework |
| FAISS - Vector similarity search |
| Sentence-Transformers - Semantic embeddings |
| Pydantic - Data validation |
| Inspired by: |
| Production reliability challenges at Fortune 500 companies |
| SRE best practices from Google, Netflix, Amazon |
| ๐ Contact & Support |
| Author: Juan Petter (LGCY Labs) |
|
|
| Email: petter2025us@outlook.com |
|
|
| LinkedIn: linkedin.com/in/petterjuan |
|
|
| Schedule Consultation: calendly.com/petter2025us/30min |
| Need Help? |
| ๐ Report a Bug |
| ๐ก Request a Feature |
| ๐ฌ Start a Discussion |
| โญ Show Your Support |
| If this project helps you build more reliable systems, please consider: |
| โญ Starring this repository |
| ๐ฆ Sharing on social media |
| ๐ Writing a blog post about your experience |
| ๐ฌ Contributing improvements back to the project |
| ๐ Project Statistics |
|
|
|
|
|
|
|
|
| For utopia...For money. |
| Production-grade reliability engineering meets AI automation. |
| Key Improvements Made: |
| โ
Better Structure - Clear sections with visual hierarchy |
|
|
| โ
Security Focus - Detailed CVE table with severity scores |
|
|
| โ
Performance Metrics - Before/after comparison tables |
|
|
| โ
Visual Architecture - ASCII diagrams for clarity |
|
|
| โ
Detailed Agent Descriptions - What each agent does with examples |
|
|
| โ
Quick Start Guide - Step-by-step installation with expected outputs |
|
|
| โ
Configuration Examples - .env file and custom policies |
|
|
| โ
Docker Support - Complete deployment instructions |
|
|
| โ
Performance Benchmarks - Real latency/throughput numbers |
|
|
| โ
Testing Guide - How to run tests with coverage |
|
|
| โ
Roadmap - Future plans clearly outlined |
|
|
| โ
Contributing Section - Encourage community involvement |
|
|
| โ
Contact Info - Multiple ways to get help |