scaler-openenv / README.md
suraj-01's picture
Initial
b14c6e3
---
title: Adaptive Alert Triage & Incident Response
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
sdk_version: "latest"
python_version: "3.11"
pinned: false
app_port: 7860
---
# Adaptive Alert Triage & Incident Response Environment (OpenEnv)
**Version**: 0.1.0
**Framework**: OpenEnv
**Status**: Alpha
## Overview
An OpenEnv-compliant reinforcement learning environment that simulates real-time IT alert triage and incident response. Agents must intelligently prioritize alerts under resource constraints while preventing cascading system failures in a partially observable, dynamic environment.
### Why RL Over Rule-Based Systems?
| **Challenge** | **Rule-Based Limitation** | **RL Advantage** |
| --------------------------- | ---------------------------------------------------------- | ------------------------------------------------------ |
| **Dynamic Patterns** | Static thresholds fail as alert patterns evolve | Learns from feedback, adapts to changing distributions |
| **Context Awareness** | Cannot capture alert correlations or temporal dependencies | Discovers hidden relationships through experience |
| **Resource Optimization** | Fixed allocation ignores varying system states | Optimizes action selection under real-time constraints |
| **False Positive Handling** | Uniform treatment leads to alert fatigue | Learns nuanced confidence signals and noise patterns |
| **Cascading Failures** | Reactive approach misses early warning signs | Proactive detection through predictive state modeling |
## Environment Specification
### State Space (Partial Observability)
**Visible Features:**
- `alerts`: List of active alerts with:
- `id`: Unique alert identifier
- `visible_severity`: Noisy severity score (0.0-1.0)
- `confidence`: Detection confidence (0.0-1.0)
- `alert_type`: Category (CPU, MEMORY, DISK, NETWORK, APPLICATION, SECURITY)
- `age`: Time steps since alert generation
- `system_load`: Current system resource utilization (0.0-1.0)
- `queue_length`: Number of unprocessed alerts
- `time_remaining`: Steps left in episode
**Hidden Features** (ground truth for reward computation):
- `true_severity`: Actual criticality of each alert
- `correlations`: Alert dependency graph
- `future_failures`: Predicted cascading failure probabilities
### Action Space
Per alert, the agent can execute:
- **INVESTIGATE**: Allocate resources to diagnose (costly but resolves critical issues)
- **IGNORE**: Mark as noise (efficient for false positives)
- **ESCALATE**: Route to specialist team (high-confidence critical alerts)
- **DELAY**: Defer to next time step (queue management)
**Resource Constraints**: Maximum K investigations per time step (task-dependent).
### Reward Structure
```python
+10 # Critical alert correctly investigated
+5 # Cascading failure prevented through correlation detection
+3 # False positive correctly ignored
-2 # Unnecessary investigation (resource waste)
-8 # Missed critical alert
-10 # System failure due to ignored critical issue
```
### Episode Dynamics
- **Length**: 20-50 time steps (task-dependent)
- **Termination**: Max steps reached OR failure threshold exceeded
- **Alert Generation**: Continuous stochastic process with temporal correlation
- **Failure Mechanics**: Ignored critical alerts accumulate damage, triggering cascading failures
## Tasks
### 1. Easy: Basic Alert Prioritization
**Objective**: Correctly classify and handle alerts based on visible signals.
**Success Criteria**: β‰₯70% correct action rate
**Key Challenge**: Distinguish genuine critical alerts from noise
**Grading**: `correct_actions / total_actions`
### 2. Medium: Resource-Constrained Triage
**Objective**: Optimize triage under strict investigation limits.
**Success Criteria**: β‰₯65% weighted efficiency score
**Key Challenge**: Maximize critical alert resolution with limited resources
**Grading**: `(weighted_resolved_alerts * resource_efficiency)`
### 3. Hard: Cascading Failures Prevention
**Objective**: Detect correlated alerts and prevent future failures.
**Success Criteria**: β‰₯60% score with stability requirements
**Key Challenge**: Infer hidden correlations and predict failure chains
**Grading**: `(prevented_failures - system_instability_penalty) / max_possible`
## Installation
### Local Setup
```bash
# Clone repository
git clone https://github.com/scalar/adaptive-alert-triage.git
cd adaptive-alert-triage
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install package in editable mode
pip install -e .
```
### Docker Setup
```bash
# Build Docker image
docker build -t adaptive-alert-triage:latest .
# Run validation
docker run --rm adaptive-alert-triage:latest
# Run evaluation with OpenAI API key
docker run --rm -e OPENAI_API_KEY=your_key adaptive-alert-triage:latest python evaluation/evaluate.py
```
## Usage
### Quick Start
```python
from adaptive_alert_triage.env import AdaptiveAlertTriageEnv
from adaptive_alert_triage.models import Action
# Initialize environment with easy task
env = AdaptiveAlertTriageEnv(task_id="easy")
# Reset environment
observation = env.reset()
# Run episode
done = False
total_reward = 0
while not done:
# Example: investigate first alert
action = Action(
alert_id=observation.alerts[0].id,
action_type="INVESTIGATE"
)
observation, reward, done, info = env.step(action)
total_reward += reward.value
print(f"Episode reward: {total_reward}")
print(f"Task score: {info['task_score']}")
```
### Running Baseline Agents
```bash
# Rule-based baseline
python agents/baseline.py --task easy
# OpenAI inference baseline (requires OPENAI_API_KEY)
export OPENAI_API_KEY=your_key_here
python agents/inference.py --task medium
```
### Evaluation
```bash
# Run all baselines on all tasks
python evaluation/evaluate.py
# Generate comparison plots
python evaluation/plots.py
```
## Testing
```bash
# Run all tests
pytest tests/
# Run with coverage
pytest --cov=src/adaptive_alert_triage tests/
# Run specific test file
pytest tests/test_env.py -v
```
## Docker + RL Server
The environment includes a production-ready FastAPI server for remote RL training.
### Architecture
```
External World (Datadog/Kafka) ──POST /ingest/alerts──> Docker (FastAPI Server)
β”‚
β”‚ Internal: AdaptiveAlertTriageEnv
β”‚ (real + synthetic alerts)
↓
External RL Trainer (SB3) ──/env/reset───────────> β”‚ <──/env/step(action)── Obs/Reward/Done
β”‚
↓
RL beats baselines! (0.61 β†’ 0.82+)
```
### Quick Start
```bash
# 1. Build and run the persistent RL server
docker compose up --build -d
# 2. Verify server health
curl http://localhost:8000/health
# 3. Send real alerts (simulate Datadog webhook)
bash scripts/demo_webhook.sh
# 4. Train external RL agent
pip install stable-baselines3
python train_external.py
# 5. View metrics
curl http://localhost:8000/metrics
```
### API Endpoints
| Endpoint | Method | Description |
| ---------------------- | ------ | --------------------------------------- |
| `/health` | GET | Health check (env_ready, queue_size) |
| `/metrics` | GET | RL score vs baseline comparison |
| `/ingest/alerts` | POST | Webhook receiver for Datadog/Kafka |
| `/env/reset/{task_id}` | POST | Initialize episode (easy/medium/hard) |
| `/env/step` | POST | Take RL action, receive obs/reward/done |
| `/env/state` | GET | Debug: current episode state |
| `/tasks` | GET | List available tasks |
| `/ws/train` | WS | Real-time streaming RL loop |
### WebSocket Training
```python
import websockets
import json
async with websockets.connect("ws://localhost:8000/ws/train") as ws:
# Reset
await ws.send(json.dumps({"type": "reset", "task_id": "hard"}))
obs = await ws.recv()
# Step loop
while True:
await ws.send(json.dumps({
"type": "step",
"action": {"alert_id": "A1", "action_type": "INVESTIGATE"}
}))
result = await ws.recv()
if json.loads(result)["done"]:
break
```
---
## Project Structure
```
adaptive_alert_triage_openenv/
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ pyproject.toml # Project metadata and dependencies
β”œβ”€β”€ openenv.yaml # OpenEnv specification
β”œβ”€β”€ Dockerfile # Container build instructions
β”œβ”€β”€ requirements.txt # Python dependencies
β”‚
β”œβ”€β”€ src/adaptive_alert_triage/ # Core environment implementation
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ env.py # Main Gym environment
β”‚ β”œβ”€β”€ models.py # Pydantic Observation/Action/Reward models
β”‚ └── utils.py # Helper functions
β”‚
β”œβ”€β”€ tasks/ # Task definitions and graders
β”‚ β”œβ”€β”€ easy.py # Basic prioritization
β”‚ β”œβ”€β”€ medium.py # Resource-constrained triage
β”‚ └── hard.py # Cascading failure prevention
β”‚
β”œβ”€β”€ rewards/ # Reward shaping logic
β”‚ └── reward.py
β”‚
β”œβ”€β”€ agents/ # Baseline and example agents
β”‚ β”œβ”€β”€ baseline.py # Rule-based threshold agent
β”‚ └── inference.py # OpenAI API baseline
β”‚
β”œβ”€β”€ tests/ # Unit and integration tests
β”‚ β”œβ”€β”€ test_env.py
β”‚ β”œβ”€β”€ test_tasks.py
β”‚ └── test_rewards.py
β”‚
β”œβ”€β”€ evaluation/ # Performance analysis
β”‚ β”œβ”€β”€ evaluate.py # Run benchmarks
β”‚ └── plots.py # Generate comparison charts
β”‚
└── docker/ # Docker utilities
└── entrypoint.sh # Container startup script
```
## OpenEnv Compliance
This environment adheres to the OpenEnv specification:
- βœ… Pydantic models for Observation, Action, and Reward
- βœ… OpenEnv-compatible API (`reset()`, `step()`, `state()`)
- βœ… Task-based evaluation with graders
- βœ… Reproducible seeding
- βœ… Docker containerization
- βœ… `openenv.yaml` metadata
## Contributing
Contributions are welcome! Please follow:
1. Black code formatting (`black .`)
2. Type hints for all functions
3. Docstrings in Google style
4. Unit tests for new features
## License
MIT License - see LICENSE file for details.