Spaces:
Sleeping
Sleeping
File size: 11,236 Bytes
b14c6e3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 | ---
title: Adaptive Alert Triage & Incident Response
emoji: π¨
colorFrom: red
colorTo: yellow
sdk: docker
sdk_version: "latest"
python_version: "3.11"
pinned: false
app_port: 7860
---
# Adaptive Alert Triage & Incident Response Environment (OpenEnv)
**Version**: 0.1.0
**Framework**: OpenEnv
**Status**: Alpha
## Overview
An OpenEnv-compliant reinforcement learning environment that simulates real-time IT alert triage and incident response. Agents must intelligently prioritize alerts under resource constraints while preventing cascading system failures in a partially observable, dynamic environment.
### Why RL Over Rule-Based Systems?
| **Challenge** | **Rule-Based Limitation** | **RL Advantage** |
| --------------------------- | ---------------------------------------------------------- | ------------------------------------------------------ |
| **Dynamic Patterns** | Static thresholds fail as alert patterns evolve | Learns from feedback, adapts to changing distributions |
| **Context Awareness** | Cannot capture alert correlations or temporal dependencies | Discovers hidden relationships through experience |
| **Resource Optimization** | Fixed allocation ignores varying system states | Optimizes action selection under real-time constraints |
| **False Positive Handling** | Uniform treatment leads to alert fatigue | Learns nuanced confidence signals and noise patterns |
| **Cascading Failures** | Reactive approach misses early warning signs | Proactive detection through predictive state modeling |
## Environment Specification
### State Space (Partial Observability)
**Visible Features:**
- `alerts`: List of active alerts with:
- `id`: Unique alert identifier
- `visible_severity`: Noisy severity score (0.0-1.0)
- `confidence`: Detection confidence (0.0-1.0)
- `alert_type`: Category (CPU, MEMORY, DISK, NETWORK, APPLICATION, SECURITY)
- `age`: Time steps since alert generation
- `system_load`: Current system resource utilization (0.0-1.0)
- `queue_length`: Number of unprocessed alerts
- `time_remaining`: Steps left in episode
**Hidden Features** (ground truth for reward computation):
- `true_severity`: Actual criticality of each alert
- `correlations`: Alert dependency graph
- `future_failures`: Predicted cascading failure probabilities
### Action Space
Per alert, the agent can execute:
- **INVESTIGATE**: Allocate resources to diagnose (costly but resolves critical issues)
- **IGNORE**: Mark as noise (efficient for false positives)
- **ESCALATE**: Route to specialist team (high-confidence critical alerts)
- **DELAY**: Defer to next time step (queue management)
**Resource Constraints**: Maximum K investigations per time step (task-dependent).
### Reward Structure
```python
+10 # Critical alert correctly investigated
+5 # Cascading failure prevented through correlation detection
+3 # False positive correctly ignored
-2 # Unnecessary investigation (resource waste)
-8 # Missed critical alert
-10 # System failure due to ignored critical issue
```
### Episode Dynamics
- **Length**: 20-50 time steps (task-dependent)
- **Termination**: Max steps reached OR failure threshold exceeded
- **Alert Generation**: Continuous stochastic process with temporal correlation
- **Failure Mechanics**: Ignored critical alerts accumulate damage, triggering cascading failures
## Tasks
### 1. Easy: Basic Alert Prioritization
**Objective**: Correctly classify and handle alerts based on visible signals.
**Success Criteria**: β₯70% correct action rate
**Key Challenge**: Distinguish genuine critical alerts from noise
**Grading**: `correct_actions / total_actions`
### 2. Medium: Resource-Constrained Triage
**Objective**: Optimize triage under strict investigation limits.
**Success Criteria**: β₯65% weighted efficiency score
**Key Challenge**: Maximize critical alert resolution with limited resources
**Grading**: `(weighted_resolved_alerts * resource_efficiency)`
### 3. Hard: Cascading Failures Prevention
**Objective**: Detect correlated alerts and prevent future failures.
**Success Criteria**: β₯60% score with stability requirements
**Key Challenge**: Infer hidden correlations and predict failure chains
**Grading**: `(prevented_failures - system_instability_penalty) / max_possible`
## Installation
### Local Setup
```bash
# Clone repository
git clone https://github.com/scalar/adaptive-alert-triage.git
cd adaptive-alert-triage
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install package in editable mode
pip install -e .
```
### Docker Setup
```bash
# Build Docker image
docker build -t adaptive-alert-triage:latest .
# Run validation
docker run --rm adaptive-alert-triage:latest
# Run evaluation with OpenAI API key
docker run --rm -e OPENAI_API_KEY=your_key adaptive-alert-triage:latest python evaluation/evaluate.py
```
## Usage
### Quick Start
```python
from adaptive_alert_triage.env import AdaptiveAlertTriageEnv
from adaptive_alert_triage.models import Action
# Initialize environment with easy task
env = AdaptiveAlertTriageEnv(task_id="easy")
# Reset environment
observation = env.reset()
# Run episode
done = False
total_reward = 0
while not done:
# Example: investigate first alert
action = Action(
alert_id=observation.alerts[0].id,
action_type="INVESTIGATE"
)
observation, reward, done, info = env.step(action)
total_reward += reward.value
print(f"Episode reward: {total_reward}")
print(f"Task score: {info['task_score']}")
```
### Running Baseline Agents
```bash
# Rule-based baseline
python agents/baseline.py --task easy
# OpenAI inference baseline (requires OPENAI_API_KEY)
export OPENAI_API_KEY=your_key_here
python agents/inference.py --task medium
```
### Evaluation
```bash
# Run all baselines on all tasks
python evaluation/evaluate.py
# Generate comparison plots
python evaluation/plots.py
```
## Testing
```bash
# Run all tests
pytest tests/
# Run with coverage
pytest --cov=src/adaptive_alert_triage tests/
# Run specific test file
pytest tests/test_env.py -v
```
## Docker + RL Server
The environment includes a production-ready FastAPI server for remote RL training.
### Architecture
```
External World (Datadog/Kafka) ββPOST /ingest/alertsββ> Docker (FastAPI Server)
β
β Internal: AdaptiveAlertTriageEnv
β (real + synthetic alerts)
β
External RL Trainer (SB3) ββ/env/resetβββββββββββ> β <ββ/env/step(action)ββ Obs/Reward/Done
β
β
RL beats baselines! (0.61 β 0.82+)
```
### Quick Start
```bash
# 1. Build and run the persistent RL server
docker compose up --build -d
# 2. Verify server health
curl http://localhost:8000/health
# 3. Send real alerts (simulate Datadog webhook)
bash scripts/demo_webhook.sh
# 4. Train external RL agent
pip install stable-baselines3
python train_external.py
# 5. View metrics
curl http://localhost:8000/metrics
```
### API Endpoints
| Endpoint | Method | Description |
| ---------------------- | ------ | --------------------------------------- |
| `/health` | GET | Health check (env_ready, queue_size) |
| `/metrics` | GET | RL score vs baseline comparison |
| `/ingest/alerts` | POST | Webhook receiver for Datadog/Kafka |
| `/env/reset/{task_id}` | POST | Initialize episode (easy/medium/hard) |
| `/env/step` | POST | Take RL action, receive obs/reward/done |
| `/env/state` | GET | Debug: current episode state |
| `/tasks` | GET | List available tasks |
| `/ws/train` | WS | Real-time streaming RL loop |
### WebSocket Training
```python
import websockets
import json
async with websockets.connect("ws://localhost:8000/ws/train") as ws:
# Reset
await ws.send(json.dumps({"type": "reset", "task_id": "hard"}))
obs = await ws.recv()
# Step loop
while True:
await ws.send(json.dumps({
"type": "step",
"action": {"alert_id": "A1", "action_type": "INVESTIGATE"}
}))
result = await ws.recv()
if json.loads(result)["done"]:
break
```
---
## Project Structure
```
adaptive_alert_triage_openenv/
βββ README.md # This file
βββ pyproject.toml # Project metadata and dependencies
βββ openenv.yaml # OpenEnv specification
βββ Dockerfile # Container build instructions
βββ requirements.txt # Python dependencies
β
βββ src/adaptive_alert_triage/ # Core environment implementation
β βββ __init__.py
β βββ env.py # Main Gym environment
β βββ models.py # Pydantic Observation/Action/Reward models
β βββ utils.py # Helper functions
β
βββ tasks/ # Task definitions and graders
β βββ easy.py # Basic prioritization
β βββ medium.py # Resource-constrained triage
β βββ hard.py # Cascading failure prevention
β
βββ rewards/ # Reward shaping logic
β βββ reward.py
β
βββ agents/ # Baseline and example agents
β βββ baseline.py # Rule-based threshold agent
β βββ inference.py # OpenAI API baseline
β
βββ tests/ # Unit and integration tests
β βββ test_env.py
β βββ test_tasks.py
β βββ test_rewards.py
β
βββ evaluation/ # Performance analysis
β βββ evaluate.py # Run benchmarks
β βββ plots.py # Generate comparison charts
β
βββ docker/ # Docker utilities
βββ entrypoint.sh # Container startup script
```
## OpenEnv Compliance
This environment adheres to the OpenEnv specification:
- β
Pydantic models for Observation, Action, and Reward
- β
OpenEnv-compatible API (`reset()`, `step()`, `state()`)
- β
Task-based evaluation with graders
- β
Reproducible seeding
- β
Docker containerization
- β
`openenv.yaml` metadata
## Contributing
Contributions are welcome! Please follow:
1. Black code formatting (`black .`)
2. Type hints for all functions
3. Docstrings in Google style
4. Unit tests for new features
## License
MIT License - see LICENSE file for details.
|