Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,453 +1,109 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 🧠
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
-
license:
|
| 11 |
-
short_description:
|
| 12 |
---
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
│ Policy Engine │
|
| 81 |
-
│ (Auto-Healing) │
|
| 82 |
-
└──────────────────┘
|
| 83 |
-
▼
|
| 84 |
-
┌──────────────────┐
|
| 85 |
-
│ Healing Actions │
|
| 86 |
-
│ • Restart │
|
| 87 |
-
│ • Scale Out │
|
| 88 |
-
│ • Rollback │
|
| 89 |
-
│ • Circuit Break │
|
| 90 |
-
└──────────────────┘
|
| 91 |
-
🕵️ Detective Agent - Anomaly Detection
|
| 92 |
-
Adaptive multi-dimensional scoring with 95%+ accuracy
|
| 93 |
-
Real-time latency spike detection (adaptive thresholds)
|
| 94 |
-
Error rate anomaly classification
|
| 95 |
-
Resource exhaustion monitoring (CPU/Memory)
|
| 96 |
-
Throughput degradation analysis
|
| 97 |
-
Confidence scoring for all detections
|
| 98 |
-
Example Output:
|
| 99 |
-
Anomaly Detected
|
| 100 |
-
Yes
|
| 101 |
-
Confidence
|
| 102 |
-
0.95
|
| 103 |
-
Affected Metrics
|
| 104 |
-
latency, error_rate, cpu
|
| 105 |
-
Severity
|
| 106 |
-
CRITICAL
|
| 107 |
-
🔍 Diagnostician Agent - Root Cause Analysis
|
| 108 |
-
Pattern-based intelligent diagnosis
|
| 109 |
-
Identifies root causes through evidence correlation:
|
| 110 |
-
🗄️ Database connection failures
|
| 111 |
-
🔥 Resource exhaustion patterns
|
| 112 |
-
🐛 Application bugs (error spike without latency)
|
| 113 |
-
🌐 External dependency failures
|
| 114 |
-
⚙️ Configuration issues
|
| 115 |
-
Example Output:
|
| 116 |
-
Root Causes
|
| 117 |
-
Item 1
|
| 118 |
-
Type
|
| 119 |
-
Database Connection Pool Exhausted
|
| 120 |
-
Confidence
|
| 121 |
-
0.85
|
| 122 |
-
Evidence
|
| 123 |
-
high_latency, timeout_errors
|
| 124 |
-
Recommendation
|
| 125 |
-
Scale connection pool or add circuit breaker
|
| 126 |
-
🔮 Predictive Agent - Time-Series Forecasting
|
| 127 |
-
Lightweight statistical forecasting with 15-minute lookahead
|
| 128 |
-
Predicts future system state using:
|
| 129 |
-
Linear regression for trending metrics
|
| 130 |
-
Exponential smoothing for volatile metrics
|
| 131 |
-
Time-to-failure estimates
|
| 132 |
-
Risk level classification
|
| 133 |
-
Example Output:
|
| 134 |
-
Forecasts
|
| 135 |
-
Item 1
|
| 136 |
-
Metric
|
| 137 |
-
latency
|
| 138 |
-
Predicted Value
|
| 139 |
-
815.6
|
| 140 |
-
Confidence
|
| 141 |
-
0.82
|
| 142 |
-
Trend
|
| 143 |
-
increasing
|
| 144 |
-
Time To Critical
|
| 145 |
-
12 minutes
|
| 146 |
-
Risk Level
|
| 147 |
-
critical
|
| 148 |
-
🚀 Quick Start
|
| 149 |
-
Prerequisites
|
| 150 |
-
Python 3.10+
|
| 151 |
-
4GB RAM minimum (8GB recommended)
|
| 152 |
-
2 CPU cores minimum (4 cores recommended)
|
| 153 |
-
Installation
|
| 154 |
-
# 1. Clone the repository
|
| 155 |
-
git clone https://github.com/petterjuan/agentic-reliability-framework.git
|
| 156 |
cd agentic-reliability-framework
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
-
|
| 159 |
-
python3.10 -m venv venv
|
| 160 |
-
source venv/bin/activate # Windows: venv\Scripts\activate
|
| 161 |
-
|
| 162 |
-
# 3. Install dependencies
|
| 163 |
-
pip install --upgrade pip
|
| 164 |
-
pip install -r requirements.txt
|
| 165 |
-
|
| 166 |
-
# 4. Verify security patches
|
| 167 |
-
pip show gradio requests # Check versions match requirements.txt
|
| 168 |
-
|
| 169 |
-
# 5. Run tests (optional but recommended)
|
| 170 |
-
pytest tests/ -v --cov
|
| 171 |
-
|
| 172 |
-
# 6. Create data directories
|
| 173 |
-
mkdir -p data logs tests
|
| 174 |
-
|
| 175 |
-
# 7. Start the application
|
| 176 |
-
python app.py
|
| 177 |
-
Expected Output:
|
| 178 |
-
2025-12-01 09:00:00 - INFO - Loading SentenceTransformer model...
|
| 179 |
-
2025-12-01 09:00:02 - INFO - SentenceTransformer model loaded successfully
|
| 180 |
-
2025-12-01 09:00:02 - INFO - Initialized ProductionFAISSIndex with 0 vectors
|
| 181 |
-
2025-12-01 09:00:02 - INFO - Initialized PolicyEngine with 5 policies
|
| 182 |
-
2025-12-01 09:00:02 - INFO - Launching Gradio UI on 0.0.0.0:7860...
|
| 183 |
-
|
| 184 |
-
Running on local URL: http://127.0.0.1:7860
|
| 185 |
-
First Test Event
|
| 186 |
-
Navigate to http://localhost:7860 and submit:
|
| 187 |
-
Component: api-service
|
| 188 |
-
Latency P99: 450 ms
|
| 189 |
-
Error Rate: 0.25 (25%)
|
| 190 |
-
Throughput: 800 req/s
|
| 191 |
-
CPU Utilization: 0.88 (88%)
|
| 192 |
-
Memory Utilization: 0.75 (75%)
|
| 193 |
-
Expected Response:
|
| 194 |
-
✅ Status: ANOMALY
|
| 195 |
-
🎯 Confidence: 95.5%
|
| 196 |
-
🔥 Severity: CRITICAL
|
| 197 |
-
💰 Business Impact: $21.67 revenue loss, 5374 users affected
|
| 198 |
-
|
| 199 |
-
🚨 Recommended Actions:
|
| 200 |
-
• Scale out resources (CPU/Memory critical)
|
| 201 |
-
• Check database connections (high latency)
|
| 202 |
-
• Consider rollback (error rate >20%)
|
| 203 |
-
|
| 204 |
-
🔮 Predictions:
|
| 205 |
-
• Latency will reach 816ms in 12 minutes
|
| 206 |
-
• Error rate will reach 37% in 15 minutes
|
| 207 |
-
• System failure imminent without intervention
|
| 208 |
-
📊 Key Features
|
| 209 |
-
1️⃣ Real-Time Anomaly Detection
|
| 210 |
-
Sub-100ms latency (p50) for event processing
|
| 211 |
-
Multi-dimensional scoring across latency, errors, resources
|
| 212 |
-
Adaptive thresholds that learn from your environment
|
| 213 |
-
95%+ accuracy with confidence estimates
|
| 214 |
-
2️⃣ Automated Healing Policies
|
| 215 |
-
5 Built-in Policies:
|
| 216 |
-
Policy Trigger Actions Cooldown
|
| 217 |
-
High Latency Restart Latency >500ms Restart + Alert 5 min
|
| 218 |
-
Critical Error Rollback Error rate >30% Rollback + Circuit Breaker 10 min
|
| 219 |
-
High Error Traffic Shift Error rate >15% Traffic Shift + Alert 5 min
|
| 220 |
-
Resource Exhaustion Scale CPU/Memory >90% Scale Out 10 min
|
| 221 |
-
Moderate Latency Circuit Latency >300ms Circuit Breaker 3 min
|
| 222 |
-
Cooldown & Rate Limiting:
|
| 223 |
-
Prevents action spam (e.g., restart loops)
|
| 224 |
-
Per-policy, per-component cooldown tracking
|
| 225 |
-
Rate limits: max 5-10 executions/hour per policy
|
| 226 |
-
3️⃣ Business Impact Quantification
|
| 227 |
-
Calculates real-time business metrics:
|
| 228 |
-
💰 Estimated revenue loss (based on throughput drop)
|
| 229 |
-
👥 Affected user count (from error rate × throughput)
|
| 230 |
-
⏱️ Service degradation duration
|
| 231 |
-
📉 SLO breach severity
|
| 232 |
-
4️⃣ Vector-Based Incident Memory
|
| 233 |
-
FAISS index stores 384-dimensional embeddings of incidents
|
| 234 |
-
Semantic similarity search finds similar past issues
|
| 235 |
-
Solution recommendation based on historical resolutions
|
| 236 |
-
Thread-safe single-writer pattern with atomic saves
|
| 237 |
-
5️⃣ Predictive Analytics
|
| 238 |
-
Time-series forecasting with 15-minute lookahead
|
| 239 |
-
Trend detection (increasing/decreasing/stable)
|
| 240 |
-
Time-to-failure estimates
|
| 241 |
-
Risk classification (low/medium/high/critical)
|
| 242 |
-
🛠️ Configuration
|
| 243 |
-
Environment Variables
|
| 244 |
-
Create a .env file:
|
| 245 |
-
# Optional: Hugging Face API token
|
| 246 |
-
HF_TOKEN=your_hf_token_here
|
| 247 |
-
|
| 248 |
-
# Data persistence
|
| 249 |
-
DATA_DIR=./data
|
| 250 |
-
INDEX_FILE=data/incident_vectors.index
|
| 251 |
-
TEXTS_FILE=data/incident_texts.json
|
| 252 |
-
|
| 253 |
-
# Application settings
|
| 254 |
-
LOG_LEVEL=INFO
|
| 255 |
-
MAX_REQUESTS_PER_MINUTE=60
|
| 256 |
-
MAX_REQUESTS_PER_HOUR=500
|
| 257 |
-
|
| 258 |
-
# Server
|
| 259 |
-
HOST=0.0.0.0
|
| 260 |
-
PORT=7860
|
| 261 |
-
Custom Healing Policies
|
| 262 |
-
Add your own policies in healing_policies.py:
|
| 263 |
-
custom_policy = HealingPolicy(
|
| 264 |
-
name="custom_high_latency",
|
| 265 |
-
conditions=[
|
| 266 |
-
PolicyCondition(
|
| 267 |
-
metric="latency_p99",
|
| 268 |
-
operator="gt",
|
| 269 |
-
threshold=200.0
|
| 270 |
-
)
|
| 271 |
-
],
|
| 272 |
-
actions=[
|
| 273 |
-
HealingAction.RESTART_CONTAINER,
|
| 274 |
-
HealingAction.ALERT_TEAM
|
| 275 |
-
],
|
| 276 |
-
priority=1,
|
| 277 |
-
cool_down_seconds=300,
|
| 278 |
-
max_executions_per_hour=5,
|
| 279 |
-
enabled=True
|
| 280 |
-
)
|
| 281 |
-
🐳 Docker Deployment
|
| 282 |
-
Dockerfile
|
| 283 |
-
FROM python:3.10-slim
|
| 284 |
-
|
| 285 |
-
WORKDIR /app
|
| 286 |
-
|
| 287 |
-
# Install system dependencies
|
| 288 |
-
RUN apt-get update && apt-get install -y \
|
| 289 |
-
gcc g++ && \
|
| 290 |
-
rm -rf /var/lib/apt/lists/*
|
| 291 |
-
|
| 292 |
-
# Copy and install Python dependencies
|
| 293 |
-
COPY requirements.txt .
|
| 294 |
-
RUN pip install --no-cache-dir -r requirements.txt
|
| 295 |
-
|
| 296 |
-
# Copy application
|
| 297 |
-
COPY . .
|
| 298 |
-
|
| 299 |
-
# Create directories
|
| 300 |
-
RUN mkdir -p data logs
|
| 301 |
-
|
| 302 |
-
EXPOSE 7860
|
| 303 |
-
|
| 304 |
-
CMD ["python", "app.py"]
|
| 305 |
-
Docker Compose
|
| 306 |
-
version: '3.8'
|
| 307 |
-
|
| 308 |
-
services:
|
| 309 |
-
arf:
|
| 310 |
-
build: .
|
| 311 |
-
ports:
|
| 312 |
-
- "7860:7860"
|
| 313 |
-
environment:
|
| 314 |
-
- HF_TOKEN=${HF_TOKEN}
|
| 315 |
-
- LOG_LEVEL=INFO
|
| 316 |
-
volumes:
|
| 317 |
-
- ./data:/app/data
|
| 318 |
-
- ./logs:/app/logs
|
| 319 |
-
restart: unless-stopped
|
| 320 |
-
deploy:
|
| 321 |
-
resources:
|
| 322 |
-
limits:
|
| 323 |
-
cpus: '4'
|
| 324 |
-
memory: 4G
|
| 325 |
-
Run:
|
| 326 |
-
docker-compose up -d
|
| 327 |
-
🧪 Testing
|
| 328 |
-
Run All Tests
|
| 329 |
-
# Basic test run
|
| 330 |
-
pytest tests/ -v
|
| 331 |
-
|
| 332 |
-
# With coverage report
|
| 333 |
-
pytest tests/ --cov --cov-report=html --cov-report=term-missing
|
| 334 |
-
|
| 335 |
-
# Coverage summary
|
| 336 |
-
# models.py 95% coverage
|
| 337 |
-
# healing_policies.py 90% coverage
|
| 338 |
-
# app.py 86% coverage
|
| 339 |
-
# ──────────────────────────────────────
|
| 340 |
-
# TOTAL 87% coverage
|
| 341 |
-
Test Categories
|
| 342 |
-
# Unit tests
|
| 343 |
-
pytest tests/test_models.py -v
|
| 344 |
-
pytest tests/test_policy_engine.py -v
|
| 345 |
-
|
| 346 |
-
# Thread safety tests
|
| 347 |
-
pytest tests/test_policy_engine.py::TestThreadSafety -v
|
| 348 |
-
|
| 349 |
-
# Integration tests
|
| 350 |
-
pytest tests/test_input_validation.py -v
|
| 351 |
-
📈 Performance Benchmarks
|
| 352 |
-
Latency Breakdown (Intel i7, 16GB RAM)
|
| 353 |
-
Component Time (p50) Time (p99)
|
| 354 |
-
Input Validation 1.2ms 3.0ms
|
| 355 |
-
Event Construction 4.8ms 10.0ms
|
| 356 |
-
Detective Agent 18.3ms 35.0ms
|
| 357 |
-
Diagnostician Agent 22.7ms 45.0ms
|
| 358 |
-
Predictive Agent 41.2ms 85.0ms
|
| 359 |
-
Policy Evaluation 19.5ms 38.0ms
|
| 360 |
-
Vector Encoding 15.7ms 30.0ms
|
| 361 |
-
Total ~100ms ~250ms
|
| 362 |
-
Throughput
|
| 363 |
-
Single instance: 100+ events/second
|
| 364 |
-
With rate limiting: 60 events/minute per user
|
| 365 |
-
Memory stable: ~250MB steady-state
|
| 366 |
-
CPU usage: ~40-60% (4 cores)
|
| 367 |
-
📚 Documentation
|
| 368 |
-
📖 Technical Deep Dive - Architecture & algorithms
|
| 369 |
-
🔌 API Reference - Complete API documentation
|
| 370 |
-
🚀 Deployment Guide - Production deployment
|
| 371 |
-
🧪 Testing Guide - Test strategy & coverage
|
| 372 |
-
🤝 Contributing - How to contribute
|
| 373 |
-
🗺️ Roadmap
|
| 374 |
-
v2.1 (Next Release)
|
| 375 |
-
Distributed FAISS index (multi-node scaling)
|
| 376 |
-
Prometheus/Grafana integration
|
| 377 |
-
Slack/PagerDuty notifications
|
| 378 |
-
Custom alerting rules engine
|
| 379 |
-
v3.0 (Future)
|
| 380 |
-
Reinforcement learning for policy optimization
|
| 381 |
-
LSTM-based forecasting
|
| 382 |
-
Graph neural networks for dependency analysis
|
| 383 |
-
Federated learning for cross-org knowledge sharing
|
| 384 |
-
🤝 Contributing
|
| 385 |
-
We welcome contributions! See CONTRIBUTING.md for guidelines.
|
| 386 |
-
Ways to contribute:
|
| 387 |
-
🐛 Report bugs or security issues
|
| 388 |
-
💡 Propose new features or improvements
|
| 389 |
-
📝 Improve documentation
|
| 390 |
-
🧪 Add test coverage
|
| 391 |
-
🔧 Submit pull requests
|
| 392 |
-
📄 License
|
| 393 |
-
MIT License - see LICENSE file for details.
|
| 394 |
-
🙏 Acknowledgments
|
| 395 |
-
Built with:
|
| 396 |
-
Gradio - Web UI framework
|
| 397 |
-
FAISS - Vector similarity search
|
| 398 |
-
Sentence-Transformers - Semantic embeddings
|
| 399 |
-
Pydantic - Data validation
|
| 400 |
-
Inspired by:
|
| 401 |
-
Production reliability challenges at Fortune 500 companies
|
| 402 |
-
SRE best practices from Google, Netflix, Amazon
|
| 403 |
-
📞 Contact & Support
|
| 404 |
-
Author: Juan Petter (LGCY Labs)
|
| 405 |
-
|
| 406 |
-
Email: petter2025us@outlook.com
|
| 407 |
-
|
| 408 |
-
LinkedIn: linkedin.com/in/petterjuan
|
| 409 |
-
|
| 410 |
-
Schedule Consultation: calendly.com/petter2025us/30min
|
| 411 |
-
Need Help?
|
| 412 |
-
🐛 Report a Bug
|
| 413 |
-
💡 Request a Feature
|
| 414 |
-
💬 Start a Discussion
|
| 415 |
-
⭐ Show Your Support
|
| 416 |
-
If this project helps you build more reliable systems, please consider:
|
| 417 |
-
⭐ Starring this repository
|
| 418 |
-
🐦 Sharing on social media
|
| 419 |
-
📝 Writing a blog post about your experience
|
| 420 |
-
💬 Contributing improvements back to the project
|
| 421 |
-
📊 Project Statistics
|
| 422 |
-
|
| 423 |
-
|
| 424 |
-
|
| 425 |
-
|
| 426 |
-
For utopia...For money.
|
| 427 |
-
Production-grade reliability engineering meets AI automation.
|
| 428 |
-
Key Improvements Made:
|
| 429 |
-
✅ Better Structure - Clear sections with visual hierarchy
|
| 430 |
-
|
| 431 |
-
✅ Security Focus - Detailed CVE table with severity scores
|
| 432 |
-
|
| 433 |
-
✅ Performance Metrics - Before/after comparison tables
|
| 434 |
-
|
| 435 |
-
✅ Visual Architecture - ASCII diagrams for clarity
|
| 436 |
-
|
| 437 |
-
✅ Detailed Agent Descriptions - What each agent does with examples
|
| 438 |
-
|
| 439 |
-
✅ Quick Start Guide - Step-by-step installation with expected outputs
|
| 440 |
-
|
| 441 |
-
✅ Configuration Examples - .env file and custom policies
|
| 442 |
-
|
| 443 |
-
✅ Docker Support - Complete deployment instructions
|
| 444 |
-
|
| 445 |
-
✅ Performance Benchmarks - Real latency/throughput numbers
|
| 446 |
-
|
| 447 |
-
✅ Testing Guide - How to run tests with coverage
|
| 448 |
-
|
| 449 |
-
✅ Roadmap - Future plans clearly outlined
|
| 450 |
-
|
| 451 |
-
✅ Contributing Section - Encourage community involvement
|
| 452 |
-
|
| 453 |
-
✅ Contact Info - Multiple ways to get help
|
|
|
|
| 1 |
---
|
| 2 |
+
title: ARF v4 – Reliability Lab
|
| 3 |
emoji: 🧠
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
short_description: ARF v4 – Bayesian reliability demo
|
| 12 |
---
|
| 13 |
+
|
| 14 |
+
# 🧠 ARF v4 – Reliability Lab
|
| 15 |
+
|
| 16 |
+
This Space hosts a live, interactive demo of the **Agentic Reliability Framework v4 (OSS edition)**. It showcases the core intelligence engine – a hybrid Bayesian + Hamiltonian Monte Carlo (HMC) system that evaluates infrastructure incidents and produces advisory healing recommendations.
|
| 17 |
+
|
| 18 |
+
**All outputs are advisory only – no execution.**
|
| 19 |
+
|
| 20 |
+
[](https://github.com/petter2025us/agentic-reliability-framework)
|
| 21 |
+
[](https://github.com/petter2025us/agentic-reliability-framework/blob/main/TUTORIAL.md)
|
| 22 |
+
[](mailto:petter2025us@outlook.com)
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## 🚀 How It Works
|
| 27 |
+
|
| 28 |
+
The demo uses the `EnhancedReliabilityEngine` from the ARF v4 package. When you submit telemetry (component, latency, error rate, etc.), the engine:
|
| 29 |
+
|
| 30 |
+
1. **Runs three specialised agents** in parallel:
|
| 31 |
+
- **Detective** – anomaly detection and pattern recognition
|
| 32 |
+
- **Diagnostician** – root cause analysis
|
| 33 |
+
- **Predictive** – forecasting and trend detection
|
| 34 |
+
|
| 35 |
+
2. **Computes a risk score** using:
|
| 36 |
+
- Online **Bayesian conjugate priors** (Beta‑Binomial) per action category
|
| 37 |
+
- Offline **Hamiltonian Monte Carlo (HMC)** with NUTS (if trained) for complex patterns
|
| 38 |
+
- A weighted blend of both for the final score
|
| 39 |
+
|
| 40 |
+
3. **Applies deterministic policy thresholds** (DPT) to recommend: `APPROVE`, `DENY`, or `ESCALATE`.
|
| 41 |
+
|
| 42 |
+
4. **Returns a JSON** containing the risk score, agent insights, healing actions, and (if configured) a Claude‑generated executive summary.
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 🧪 Try It Yourself
|
| 47 |
+
|
| 48 |
+
Just fill in the form on the left and click **Analyze**. The output will appear as a formatted JSON.
|
| 49 |
+
|
| 50 |
+
Example input:
|
| 51 |
+
- **Component**: `api-service`
|
| 52 |
+
- **Latency P99**: `250 ms`
|
| 53 |
+
- **Error Rate**: `0.08`
|
| 54 |
+
- **Throughput**: `1000 req/s`
|
| 55 |
+
- **CPU Utilization**: `0.7`
|
| 56 |
+
- **Memory Utilization**: `0.6`
|
| 57 |
+
|
| 58 |
+
Expected risk score: ~0.12 (low) → `APPROVE`.
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## 📦 How This Space Is Built
|
| 63 |
+
|
| 64 |
+
- **Base image**: `python:3.10` (via Dockerfile)
|
| 65 |
+
- **Dependencies**:
|
| 66 |
+
- `git+https://github.com/petter2025us/agentic-reliability-framework.git@v4.0.0`
|
| 67 |
+
- `gradio>=4.0.0`
|
| 68 |
+
- **Source code**: a minimal `app.py` that imports and runs `EnhancedReliabilityEngine`.
|
| 69 |
+
|
| 70 |
+
All code is open source and available in the [main repository](https://github.com/petter2025us/agentic-reliability-framework).
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## 🏃 Run Locally
|
| 75 |
+
|
| 76 |
+
You can run the exact same demo on your own machine:
|
| 77 |
+
|
| 78 |
+
```bash
|
| 79 |
+
git clone https://github.com/petter2025us/agentic-reliability-framework.git
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
cd agentic-reliability-framework
|
| 81 |
+
python -m venv venv
|
| 82 |
+
source venv/bin/activate
|
| 83 |
+
pip install -e .
|
| 84 |
+
pip install gradio
|
| 85 |
+
python examples/app.py # or copy the Space's app.py
|
| 86 |
+
```
|
| 87 |
+
Then open `http://localhost:7860`.
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## 📚 Learn More
|
| 92 |
+
|
| 93 |
+
- 📘 [Full Tutorial](https://github.com/petter2025us/agentic-reliability-framework/blob/main/TUTORIAL.md)
|
| 94 |
+
- 🐙 [GitHub Repository](https://github.com/petter2025us/agentic-reliability-framework)
|
| 95 |
+
- 📖 [Contributing Guidelines](https://github.com/petter2025us/agentic-reliability-framework/blob/main/CONTRIBUTING.md)
|
| 96 |
+
- 💼 [Enterprise Inquiries](mailto:petter2025us@outlook.com)
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
## 📬 Contact
|
| 101 |
+
|
| 102 |
+
- **Email**: [petter2025us@outlook.com](mailto:petter2025us@outlook.com)
|
| 103 |
+
- **LinkedIn**: [petterjuan](https://linkedin.com/in/petterjuan)
|
| 104 |
+
- **Book a call**: [Calendly – 30 min](https://calendly.com/petter2025us/30min)
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
|
| 109 |
+
*Powered by ARF v4 – Bayesian reliability for autonomous infrastructure.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|