petter2025 commited on
Commit
0dd1bfe
ยท
verified ยท
1 Parent(s): f97adbe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -114
README.md CHANGED
@@ -10,132 +10,53 @@ pinned: false
10
  license: mit
11
  short_description: AI-powered reliability with multi-agent anomaly detection
12
  ---
 
13
 
14
- # ๐Ÿง  Agentic Reliability Framework
15
 
16
- **AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing**
 
 
 
17
 
18
- ## ๐Ÿš€ Live Demo
19
 
20
- **Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.
21
 
22
- ## ๐ŸŽฏ What It Does
 
 
 
 
23
 
24
- This framework transforms traditional monitoring into **autonomous reliability engineering**:
25
 
26
- - **๐Ÿค– Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues
27
- - **๐Ÿ”ง Automated Healing**: Policy-based auto-remediation for common failures
28
- - **๐Ÿ’ฐ Business Impact**: Real-time revenue and user impact calculations
29
- - **๐Ÿ“š Learning System**: FAISS-powered memory learns from every incident
30
- - **โšก Production Ready**: Circuit breakers, adaptive thresholds, enterprise features
31
 
32
- ## ๐Ÿ› ๏ธ Quick Start
33
 
34
- ### 1. Select a Service
35
- Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`
 
 
36
 
37
- ### 2. Adjust Metrics
38
- - **Latency P99**: Alert threshold >150ms (adaptive)
39
- - **Error Rate**: Alert threshold >0.05 (5%)
40
- - **Throughput**: Current requests per second
41
- - **CPU/Memory**: Utilization (0.0-1.0 scale)
42
 
43
- ### 3. Submit & Analyze
44
- Click **"Submit Telemetry Event"** to see AI agents in action!
45
 
46
- ## ๐Ÿ“Š Example Test Cases
 
 
 
47
 
48
- ### ๐Ÿšจ Critical Failure
49
- Component: api-service
50
- Latency: 800ms
51
- Error Rate: 0.25
52
- CPU: 0.95
53
- Memory: 0.90
54
 
55
- text
56
- *Expected: CRITICAL severity, circuit_breaker + scale_out actions*
57
 
58
- ### โš ๏ธ Performance Issue
59
- Component: auth-service
60
- Latency: 350ms
61
- Error Rate: 0.08
62
- CPU: 0.75
63
- Memory: 0.65
64
-
65
- text
66
- *Expected: HIGH severity, traffic_shift action*
67
-
68
- ### โœ… Normal Operation
69
- Component: payment-service
70
- Latency: 120ms
71
- Error Rate: 0.02
72
- CPU: 0.45
73
- Memory: 0.35
74
-
75
- text
76
- *Expected: NORMAL status, no actions needed*
77
-
78
- ## ๐Ÿ”ง Technical Features
79
-
80
- ### Multi-Agent Architecture
81
- - **๐Ÿ•ต๏ธ Detective Agent**: Anomaly detection & pattern recognition
82
- - **๐Ÿ” Diagnostician Agent**: Root cause analysis & investigation
83
- - **๐Ÿค– Orchestration Manager**: Coordinates all agents in parallel
84
-
85
- ### Smart Detection
86
- - Adaptive thresholds that learn from your environment
87
- - Multi-dimensional anomaly scoring (0-100% confidence)
88
- - Correlation analysis across metrics
89
- - FAISS vector memory for incident similarity
90
-
91
- ### Business Intelligence
92
- - Real-time revenue impact calculations
93
- - User impact estimation
94
- - Severity classification (LOW, MEDIUM, HIGH, CRITICAL)
95
-
96
- ## ๐ŸŽฎ Try These Scenarios
97
-
98
- ### Test 1: Resource Exhaustion
99
- Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger
100
-
101
- ### Test 2: High Latency + Errors
102
- Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation
103
-
104
- ### Test 3: Gradual Degradation
105
- Start with normal values and slowly increase latency/errors to see adaptive thresholds
106
-
107
- ## ๐Ÿšจ Default Alert Thresholds
108
-
109
- | Metric | Warning | Critical |
110
- |--------|---------|----------|
111
- | Latency P99 | >150ms | >300ms |
112
- | Error Rate | >0.05 | >0.15 |
113
- | CPU Utilization | >0.8 | >0.9 |
114
- | Memory Utilization | >0.8 | >0.9 |
115
-
116
- ## ๐Ÿ”ฎ Roadmap
117
-
118
- - [ ] Predictive anomaly detection
119
- - [ ] Multi-cloud coordination
120
- - [ ] Advanced root cause analysis
121
- - [ ] Automated runbook execution
122
- - [ ] Team learning and knowledge transfer
123
-
124
- ## ๐Ÿ’ก Why This Matters
125
-
126
- > "The most reliable system is the one that fixes itself before anyone notices there was a problem."
127
-
128
- This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**.
129
-
130
- ## ๐Ÿ› ๏ธ Technical Stack
131
-
132
- - **Backend**: Python, FastAPI, Sentence Transformers
133
- - **AI/ML**: FAISS, Hugging Face, Custom Agents
134
- - **Frontend**: Gradio
135
- - **Storage**: FAISS vector database, JSON metadata
136
-
137
- ---
138
-
139
- **Built with โค๏ธ by [Juan Petter](https://huggingface.co/petter2025)**
140
-
141
- *AI Infrastructure Engineer | Building Self-Healing Agentic Systems*
 
10
  license: mit
11
  short_description: AI-powered reliability with multi-agent anomaly detection
12
  ---
13
+ # ๐Ÿง  Agentic Reliability Framework (v2.0 - PATCHED)
14
 
15
+ **Multi-Agent AI System for Production Reliability Monitoring**
16
 
17
+ [![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/)
18
+ [![Security: Patched](https://img.shields.io/badge/security-patched-green.svg)](requirements.txt)
19
+ [![Tests: 40+](https://img.shields.io/badge/tests-40+-success.svg)](tests/)
20
+ [![Coverage: 80%+](https://img.shields.io/badge/coverage-80%25+-brightgreen.svg)](tests/)
21
 
22
+ ## ๐Ÿ”’ Security Fixes Applied
23
 
24
+ This version includes critical security patches:
25
 
26
+ - โœ… **Gradio 5.50.0+** - Fixes CVE-2025-23042 (CVSS 9.1), CVE-2025-48889, CVE-2025-5320
27
+ - โœ… **Requests 2.32.5+** - Fixes CVE-2023-32681 (CVSS 6.1), CVE-2024-47081
28
+ - โœ… **SHA-256 Fingerprints** - Replaced insecure MD5 hashing
29
+ - โœ… **Input Validation** - Comprehensive validation with type checking
30
+ - โœ… **Rate Limiting** - 60 requests/minute per user
31
 
32
+ ## โšก Performance Improvements
33
 
34
+ - ๐Ÿš€ **70% Faster** - Native async handlers (removed event loop creation)
35
+ - ๐Ÿ”„ **Non-blocking ML** - ProcessPoolExecutor for CPU-intensive operations
36
+ - ๐Ÿ’พ **Thread-Safe FAISS** - Single-writer pattern prevents data corruption
37
+ - ๐Ÿง  **Memory Stable** - LRU eviction prevents memory leaks
 
38
 
39
+ ## ๐Ÿงช Testing & Quality
40
 
41
+ - โœ… **40+ Unit Tests** - Comprehensive test coverage
42
+ - โœ… **Thread Safety Tests** - Race condition prevention verified
43
+ - โœ… **Concurrency Tests** - Multi-threaded execution validated
44
+ - โœ… **Integration Tests** - End-to-end pipeline testing
45
 
46
+ ## ๐Ÿ“ฆ Installation
 
 
 
 
47
 
48
+ ### Quick Start
 
49
 
50
+ ```bash
51
+ # Clone repository
52
+ git clone <your-repo-url>
53
+ cd agentic-reliability-framework
54
 
55
+ # Install dependencies
56
+ pip install -r requirements.txt
 
 
 
 
57
 
58
+ # Run tests
59
+ pytest tests/ -v --cov
60
 
61
+ # Start application
62
+ python app.py