scaler-openenv / README.md
suraj-01's picture
Initial
b14c6e3
metadata
title: Adaptive Alert Triage & Incident Response
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
sdk_version: latest
python_version: '3.11'
pinned: false
app_port: 7860

Adaptive Alert Triage & Incident Response Environment (OpenEnv)

Version: 0.1.0
Framework: OpenEnv
Status: Alpha

Overview

An OpenEnv-compliant reinforcement learning environment that simulates real-time IT alert triage and incident response. Agents must intelligently prioritize alerts under resource constraints while preventing cascading system failures in a partially observable, dynamic environment.

Why RL Over Rule-Based Systems?

Challenge Rule-Based Limitation RL Advantage
Dynamic Patterns Static thresholds fail as alert patterns evolve Learns from feedback, adapts to changing distributions
Context Awareness Cannot capture alert correlations or temporal dependencies Discovers hidden relationships through experience
Resource Optimization Fixed allocation ignores varying system states Optimizes action selection under real-time constraints
False Positive Handling Uniform treatment leads to alert fatigue Learns nuanced confidence signals and noise patterns
Cascading Failures Reactive approach misses early warning signs Proactive detection through predictive state modeling

Environment Specification

State Space (Partial Observability)

Visible Features:

  • alerts: List of active alerts with:
    • id: Unique alert identifier
    • visible_severity: Noisy severity score (0.0-1.0)
    • confidence: Detection confidence (0.0-1.0)
    • alert_type: Category (CPU, MEMORY, DISK, NETWORK, APPLICATION, SECURITY)
    • age: Time steps since alert generation
  • system_load: Current system resource utilization (0.0-1.0)
  • queue_length: Number of unprocessed alerts
  • time_remaining: Steps left in episode

Hidden Features (ground truth for reward computation):

  • true_severity: Actual criticality of each alert
  • correlations: Alert dependency graph
  • future_failures: Predicted cascading failure probabilities

Action Space

Per alert, the agent can execute:

  • INVESTIGATE: Allocate resources to diagnose (costly but resolves critical issues)
  • IGNORE: Mark as noise (efficient for false positives)
  • ESCALATE: Route to specialist team (high-confidence critical alerts)
  • DELAY: Defer to next time step (queue management)

Resource Constraints: Maximum K investigations per time step (task-dependent).

Reward Structure

+10  # Critical alert correctly investigated
+5   # Cascading failure prevented through correlation detection
+3   # False positive correctly ignored
-2   # Unnecessary investigation (resource waste)
-8   # Missed critical alert
-10  # System failure due to ignored critical issue

Episode Dynamics

  • Length: 20-50 time steps (task-dependent)
  • Termination: Max steps reached OR failure threshold exceeded
  • Alert Generation: Continuous stochastic process with temporal correlation
  • Failure Mechanics: Ignored critical alerts accumulate damage, triggering cascading failures

Tasks

1. Easy: Basic Alert Prioritization

Objective: Correctly classify and handle alerts based on visible signals.
Success Criteria: β‰₯70% correct action rate
Key Challenge: Distinguish genuine critical alerts from noise
Grading: correct_actions / total_actions

2. Medium: Resource-Constrained Triage

Objective: Optimize triage under strict investigation limits.
Success Criteria: β‰₯65% weighted efficiency score
Key Challenge: Maximize critical alert resolution with limited resources
Grading: (weighted_resolved_alerts * resource_efficiency)

3. Hard: Cascading Failures Prevention

Objective: Detect correlated alerts and prevent future failures.
Success Criteria: β‰₯60% score with stability requirements
Key Challenge: Infer hidden correlations and predict failure chains
Grading: (prevented_failures - system_instability_penalty) / max_possible

Installation

Local Setup

# Clone repository
git clone https://github.com/scalar/adaptive-alert-triage.git
cd adaptive-alert-triage

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install package in editable mode
pip install -e .

Docker Setup

# Build Docker image
docker build -t adaptive-alert-triage:latest .

# Run validation
docker run --rm adaptive-alert-triage:latest

# Run evaluation with OpenAI API key
docker run --rm -e OPENAI_API_KEY=your_key adaptive-alert-triage:latest python evaluation/evaluate.py

Usage

Quick Start

from adaptive_alert_triage.env import AdaptiveAlertTriageEnv
from adaptive_alert_triage.models import Action

# Initialize environment with easy task
env = AdaptiveAlertTriageEnv(task_id="easy")

# Reset environment
observation = env.reset()

# Run episode
done = False
total_reward = 0

while not done:
    # Example: investigate first alert
    action = Action(
        alert_id=observation.alerts[0].id,
        action_type="INVESTIGATE"
    )

    observation, reward, done, info = env.step(action)
    total_reward += reward.value

print(f"Episode reward: {total_reward}")
print(f"Task score: {info['task_score']}")

Running Baseline Agents

# Rule-based baseline
python agents/baseline.py --task easy

# OpenAI inference baseline (requires OPENAI_API_KEY)
export OPENAI_API_KEY=your_key_here
python agents/inference.py --task medium

Evaluation

# Run all baselines on all tasks
python evaluation/evaluate.py

# Generate comparison plots
python evaluation/plots.py

Testing

# Run all tests
pytest tests/

# Run with coverage
pytest --cov=src/adaptive_alert_triage tests/

# Run specific test file
pytest tests/test_env.py -v

Docker + RL Server

The environment includes a production-ready FastAPI server for remote RL training.

Architecture

External World (Datadog/Kafka) ──POST /ingest/alerts──> Docker (FastAPI Server)
                                                        β”‚
                                                        β”‚ Internal: AdaptiveAlertTriageEnv
                                                        β”‚ (real + synthetic alerts)
                                                        ↓
External RL Trainer (SB3)      ──/env/reset───────────> β”‚ <──/env/step(action)── Obs/Reward/Done
                                                        β”‚
                                                        ↓
                                                  RL beats baselines! (0.61 β†’ 0.82+)

Quick Start

# 1. Build and run the persistent RL server
docker compose up --build -d

# 2. Verify server health
curl http://localhost:8000/health

# 3. Send real alerts (simulate Datadog webhook)
bash scripts/demo_webhook.sh

# 4. Train external RL agent
pip install stable-baselines3
python train_external.py

# 5. View metrics
curl http://localhost:8000/metrics

API Endpoints

Endpoint Method Description
/health GET Health check (env_ready, queue_size)
/metrics GET RL score vs baseline comparison
/ingest/alerts POST Webhook receiver for Datadog/Kafka
/env/reset/{task_id} POST Initialize episode (easy/medium/hard)
/env/step POST Take RL action, receive obs/reward/done
/env/state GET Debug: current episode state
/tasks GET List available tasks
/ws/train WS Real-time streaming RL loop

WebSocket Training

import websockets
import json

async with websockets.connect("ws://localhost:8000/ws/train") as ws:
    # Reset
    await ws.send(json.dumps({"type": "reset", "task_id": "hard"}))
    obs = await ws.recv()

    # Step loop
    while True:
        await ws.send(json.dumps({
            "type": "step",
            "action": {"alert_id": "A1", "action_type": "INVESTIGATE"}
        }))
        result = await ws.recv()
        if json.loads(result)["done"]:
            break

Project Structure

adaptive_alert_triage_openenv/
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ pyproject.toml              # Project metadata and dependencies
β”œβ”€β”€ openenv.yaml                # OpenEnv specification
β”œβ”€β”€ Dockerfile                  # Container build instructions
β”œβ”€β”€ requirements.txt            # Python dependencies
β”‚
β”œβ”€β”€ src/adaptive_alert_triage/  # Core environment implementation
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ env.py                  # Main Gym environment
β”‚   β”œβ”€β”€ models.py               # Pydantic Observation/Action/Reward models
β”‚   └── utils.py                # Helper functions
β”‚
β”œβ”€β”€ tasks/                      # Task definitions and graders
β”‚   β”œβ”€β”€ easy.py                 # Basic prioritization
β”‚   β”œβ”€β”€ medium.py               # Resource-constrained triage
β”‚   └── hard.py                 # Cascading failure prevention
β”‚
β”œβ”€β”€ rewards/                    # Reward shaping logic
β”‚   └── reward.py
β”‚
β”œβ”€β”€ agents/                     # Baseline and example agents
β”‚   β”œβ”€β”€ baseline.py             # Rule-based threshold agent
β”‚   └── inference.py            # OpenAI API baseline
β”‚
β”œβ”€β”€ tests/                      # Unit and integration tests
β”‚   β”œβ”€β”€ test_env.py
β”‚   β”œβ”€β”€ test_tasks.py
β”‚   └── test_rewards.py
β”‚
β”œβ”€β”€ evaluation/                 # Performance analysis
β”‚   β”œβ”€β”€ evaluate.py             # Run benchmarks
β”‚   └── plots.py                # Generate comparison charts
β”‚
└── docker/                     # Docker utilities
    └── entrypoint.sh           # Container startup script

OpenEnv Compliance

This environment adheres to the OpenEnv specification:

  • βœ… Pydantic models for Observation, Action, and Reward
  • βœ… OpenEnv-compatible API (reset(), step(), state())
  • βœ… Task-based evaluation with graders
  • βœ… Reproducible seeding
  • βœ… Docker containerization
  • βœ… openenv.yaml metadata

Contributing

Contributions are welcome! Please follow:

  1. Black code formatting (black .)
  2. Type hints for all functions
  3. Docstrings in Google style
  4. Unit tests for new features

License

MIT License - see LICENSE file for details.