File size: 11,236 Bytes
b14c6e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---
title: Adaptive Alert Triage & Incident Response
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
sdk_version: "latest"
python_version: "3.11"
pinned: false
app_port: 7860
---

# Adaptive Alert Triage & Incident Response Environment (OpenEnv)

**Version**: 0.1.0  
**Framework**: OpenEnv  
**Status**: Alpha

## Overview

An OpenEnv-compliant reinforcement learning environment that simulates real-time IT alert triage and incident response. Agents must intelligently prioritize alerts under resource constraints while preventing cascading system failures in a partially observable, dynamic environment.

### Why RL Over Rule-Based Systems?

| **Challenge**               | **Rule-Based Limitation**                                  | **RL Advantage**                                       |
| --------------------------- | ---------------------------------------------------------- | ------------------------------------------------------ |
| **Dynamic Patterns**        | Static thresholds fail as alert patterns evolve            | Learns from feedback, adapts to changing distributions |
| **Context Awareness**       | Cannot capture alert correlations or temporal dependencies | Discovers hidden relationships through experience      |
| **Resource Optimization**   | Fixed allocation ignores varying system states             | Optimizes action selection under real-time constraints |
| **False Positive Handling** | Uniform treatment leads to alert fatigue                   | Learns nuanced confidence signals and noise patterns   |
| **Cascading Failures**      | Reactive approach misses early warning signs               | Proactive detection through predictive state modeling  |

## Environment Specification

### State Space (Partial Observability)

**Visible Features:**

- `alerts`: List of active alerts with:
  - `id`: Unique alert identifier
  - `visible_severity`: Noisy severity score (0.0-1.0)
  - `confidence`: Detection confidence (0.0-1.0)
  - `alert_type`: Category (CPU, MEMORY, DISK, NETWORK, APPLICATION, SECURITY)
  - `age`: Time steps since alert generation
- `system_load`: Current system resource utilization (0.0-1.0)
- `queue_length`: Number of unprocessed alerts
- `time_remaining`: Steps left in episode

**Hidden Features** (ground truth for reward computation):

- `true_severity`: Actual criticality of each alert
- `correlations`: Alert dependency graph
- `future_failures`: Predicted cascading failure probabilities

### Action Space

Per alert, the agent can execute:

- **INVESTIGATE**: Allocate resources to diagnose (costly but resolves critical issues)
- **IGNORE**: Mark as noise (efficient for false positives)
- **ESCALATE**: Route to specialist team (high-confidence critical alerts)
- **DELAY**: Defer to next time step (queue management)

**Resource Constraints**: Maximum K investigations per time step (task-dependent).

### Reward Structure

```python
+10  # Critical alert correctly investigated
+5   # Cascading failure prevented through correlation detection
+3   # False positive correctly ignored
-2   # Unnecessary investigation (resource waste)
-8   # Missed critical alert
-10  # System failure due to ignored critical issue
```

### Episode Dynamics

- **Length**: 20-50 time steps (task-dependent)
- **Termination**: Max steps reached OR failure threshold exceeded
- **Alert Generation**: Continuous stochastic process with temporal correlation
- **Failure Mechanics**: Ignored critical alerts accumulate damage, triggering cascading failures

## Tasks

### 1. Easy: Basic Alert Prioritization

**Objective**: Correctly classify and handle alerts based on visible signals.  
**Success Criteria**: β‰₯70% correct action rate  
**Key Challenge**: Distinguish genuine critical alerts from noise  
**Grading**: `correct_actions / total_actions`

### 2. Medium: Resource-Constrained Triage

**Objective**: Optimize triage under strict investigation limits.  
**Success Criteria**: β‰₯65% weighted efficiency score  
**Key Challenge**: Maximize critical alert resolution with limited resources  
**Grading**: `(weighted_resolved_alerts * resource_efficiency)`

### 3. Hard: Cascading Failures Prevention

**Objective**: Detect correlated alerts and prevent future failures.  
**Success Criteria**: β‰₯60% score with stability requirements  
**Key Challenge**: Infer hidden correlations and predict failure chains  
**Grading**: `(prevented_failures - system_instability_penalty) / max_possible`

## Installation

### Local Setup

```bash
# Clone repository
git clone https://github.com/scalar/adaptive-alert-triage.git
cd adaptive-alert-triage

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install package in editable mode
pip install -e .
```

### Docker Setup

```bash
# Build Docker image
docker build -t adaptive-alert-triage:latest .

# Run validation
docker run --rm adaptive-alert-triage:latest

# Run evaluation with OpenAI API key
docker run --rm -e OPENAI_API_KEY=your_key adaptive-alert-triage:latest python evaluation/evaluate.py
```

## Usage

### Quick Start

```python
from adaptive_alert_triage.env import AdaptiveAlertTriageEnv
from adaptive_alert_triage.models import Action

# Initialize environment with easy task
env = AdaptiveAlertTriageEnv(task_id="easy")

# Reset environment
observation = env.reset()

# Run episode
done = False
total_reward = 0

while not done:
    # Example: investigate first alert
    action = Action(
        alert_id=observation.alerts[0].id,
        action_type="INVESTIGATE"
    )

    observation, reward, done, info = env.step(action)
    total_reward += reward.value

print(f"Episode reward: {total_reward}")
print(f"Task score: {info['task_score']}")
```

### Running Baseline Agents

```bash
# Rule-based baseline
python agents/baseline.py --task easy

# OpenAI inference baseline (requires OPENAI_API_KEY)
export OPENAI_API_KEY=your_key_here
python agents/inference.py --task medium
```

### Evaluation

```bash
# Run all baselines on all tasks
python evaluation/evaluate.py

# Generate comparison plots
python evaluation/plots.py
```

## Testing

```bash
# Run all tests
pytest tests/

# Run with coverage
pytest --cov=src/adaptive_alert_triage tests/

# Run specific test file
pytest tests/test_env.py -v
```

## Docker + RL Server

The environment includes a production-ready FastAPI server for remote RL training.

### Architecture

```
External World (Datadog/Kafka) ──POST /ingest/alerts──> Docker (FastAPI Server)
                                                        β”‚
                                                        β”‚ Internal: AdaptiveAlertTriageEnv
                                                        β”‚ (real + synthetic alerts)
                                                        ↓
External RL Trainer (SB3)      ──/env/reset───────────> β”‚ <──/env/step(action)── Obs/Reward/Done
                                                        β”‚
                                                        ↓
                                                  RL beats baselines! (0.61 β†’ 0.82+)
```

### Quick Start

```bash
# 1. Build and run the persistent RL server
docker compose up --build -d

# 2. Verify server health
curl http://localhost:8000/health

# 3. Send real alerts (simulate Datadog webhook)
bash scripts/demo_webhook.sh

# 4. Train external RL agent
pip install stable-baselines3
python train_external.py

# 5. View metrics
curl http://localhost:8000/metrics
```

### API Endpoints

| Endpoint               | Method | Description                             |
| ---------------------- | ------ | --------------------------------------- |
| `/health`              | GET    | Health check (env_ready, queue_size)    |
| `/metrics`             | GET    | RL score vs baseline comparison         |
| `/ingest/alerts`       | POST   | Webhook receiver for Datadog/Kafka      |
| `/env/reset/{task_id}` | POST   | Initialize episode (easy/medium/hard)   |
| `/env/step`            | POST   | Take RL action, receive obs/reward/done |
| `/env/state`           | GET    | Debug: current episode state            |
| `/tasks`               | GET    | List available tasks                    |
| `/ws/train`            | WS     | Real-time streaming RL loop             |

### WebSocket Training

```python
import websockets
import json

async with websockets.connect("ws://localhost:8000/ws/train") as ws:
    # Reset
    await ws.send(json.dumps({"type": "reset", "task_id": "hard"}))
    obs = await ws.recv()

    # Step loop
    while True:
        await ws.send(json.dumps({
            "type": "step",
            "action": {"alert_id": "A1", "action_type": "INVESTIGATE"}
        }))
        result = await ws.recv()
        if json.loads(result)["done"]:
            break
```

---

## Project Structure

```
adaptive_alert_triage_openenv/
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ pyproject.toml              # Project metadata and dependencies
β”œβ”€β”€ openenv.yaml                # OpenEnv specification
β”œβ”€β”€ Dockerfile                  # Container build instructions
β”œβ”€β”€ requirements.txt            # Python dependencies
β”‚
β”œβ”€β”€ src/adaptive_alert_triage/  # Core environment implementation
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ env.py                  # Main Gym environment
β”‚   β”œβ”€β”€ models.py               # Pydantic Observation/Action/Reward models
β”‚   └── utils.py                # Helper functions
β”‚
β”œβ”€β”€ tasks/                      # Task definitions and graders
β”‚   β”œβ”€β”€ easy.py                 # Basic prioritization
β”‚   β”œβ”€β”€ medium.py               # Resource-constrained triage
β”‚   └── hard.py                 # Cascading failure prevention
β”‚
β”œβ”€β”€ rewards/                    # Reward shaping logic
β”‚   └── reward.py
β”‚
β”œβ”€β”€ agents/                     # Baseline and example agents
β”‚   β”œβ”€β”€ baseline.py             # Rule-based threshold agent
β”‚   └── inference.py            # OpenAI API baseline
β”‚
β”œβ”€β”€ tests/                      # Unit and integration tests
β”‚   β”œβ”€β”€ test_env.py
β”‚   β”œβ”€β”€ test_tasks.py
β”‚   └── test_rewards.py
β”‚
β”œβ”€β”€ evaluation/                 # Performance analysis
β”‚   β”œβ”€β”€ evaluate.py             # Run benchmarks
β”‚   └── plots.py                # Generate comparison charts
β”‚
└── docker/                     # Docker utilities
    └── entrypoint.sh           # Container startup script
```

## OpenEnv Compliance

This environment adheres to the OpenEnv specification:

- βœ… Pydantic models for Observation, Action, and Reward
- βœ… OpenEnv-compatible API (`reset()`, `step()`, `state()`)
- βœ… Task-based evaluation with graders
- βœ… Reproducible seeding
- βœ… Docker containerization
- βœ… `openenv.yaml` metadata

## Contributing

Contributions are welcome! Please follow:

1. Black code formatting (`black .`)
2. Type hints for all functions
3. Docstrings in Google style
4. Unit tests for new features

## License

MIT License - see LICENSE file for details.