Spaces:

scaler-hack
/

scaler-openenv

Sleeping

App Files Files Community

scaler-openenv / README.md

suraj-01

Initial

b14c6e3 about 1 month ago

preview code

raw

history blame contribute delete

11.2 kB

metadata

title: Adaptive Alert Triage & Incident Response
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
sdk_version: latest
python_version: '3.11'
pinned: false
app_port: 7860

Adaptive Alert Triage & Incident Response Environment (OpenEnv)

Version: 0.1.0
Framework: OpenEnv
Status: Alpha

Overview

An OpenEnv-compliant reinforcement learning environment that simulates real-time IT alert triage and incident response. Agents must intelligently prioritize alerts under resource constraints while preventing cascading system failures in a partially observable, dynamic environment.

Why RL Over Rule-Based Systems?

Challenge	Rule-Based Limitation	RL Advantage
Dynamic Patterns	Static thresholds fail as alert patterns evolve	Learns from feedback, adapts to changing distributions
Context Awareness	Cannot capture alert correlations or temporal dependencies	Discovers hidden relationships through experience
Resource Optimization	Fixed allocation ignores varying system states	Optimizes action selection under real-time constraints
False Positive Handling	Uniform treatment leads to alert fatigue	Learns nuanced confidence signals and noise patterns
Cascading Failures	Reactive approach misses early warning signs	Proactive detection through predictive state modeling

Environment Specification

State Space (Partial Observability)

Visible Features:

alerts: List of active alerts with:
- id: Unique alert identifier
- visible_severity: Noisy severity score (0.0-1.0)
- confidence: Detection confidence (0.0-1.0)
- alert_type: Category (CPU, MEMORY, DISK, NETWORK, APPLICATION, SECURITY)
- age: Time steps since alert generation
system_load: Current system resource utilization (0.0-1.0)
queue_length: Number of unprocessed alerts
time_remaining: Steps left in episode

Hidden Features (ground truth for reward computation):

true_severity: Actual criticality of each alert
correlations: Alert dependency graph
future_failures: Predicted cascading failure probabilities

Action Space

Per alert, the agent can execute:

INVESTIGATE: Allocate resources to diagnose (costly but resolves critical issues)
IGNORE: Mark as noise (efficient for false positives)
ESCALATE: Route to specialist team (high-confidence critical alerts)
DELAY: Defer to next time step (queue management)

Resource Constraints: Maximum K investigations per time step (task-dependent).

Reward Structure

+10  # Critical alert correctly investigated
+5   # Cascading failure prevented through correlation detection
+3   # False positive correctly ignored
-2   # Unnecessary investigation (resource waste)
-8   # Missed critical alert
-10  # System failure due to ignored critical issue

Episode Dynamics

Length: 20-50 time steps (task-dependent)
Termination: Max steps reached OR failure threshold exceeded
Alert Generation: Continuous stochastic process with temporal correlation
Failure Mechanics: Ignored critical alerts accumulate damage, triggering cascading failures

Tasks

1. Easy: Basic Alert Prioritization

Objective: Correctly classify and handle alerts based on visible signals.
Success Criteria: ≥70% correct action rate
Key Challenge: Distinguish genuine critical alerts from noise
Grading: correct_actions / total_actions

2. Medium: Resource-Constrained Triage

Objective: Optimize triage under strict investigation limits.
Success Criteria: ≥65% weighted efficiency score
Key Challenge: Maximize critical alert resolution with limited resources
Grading: (weighted_resolved_alerts * resource_efficiency)

3. Hard: Cascading Failures Prevention

Objective: Detect correlated alerts and prevent future failures.
Success Criteria: ≥60% score with stability requirements
Key Challenge: Infer hidden correlations and predict failure chains
Grading: (prevented_failures - system_instability_penalty) / max_possible

Installation

Local Setup

# Clone repository
git clone https://github.com/scalar/adaptive-alert-triage.git
cd adaptive-alert-triage

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install package in editable mode
pip install -e .

Docker Setup

# Build Docker image
docker build -t adaptive-alert-triage:latest .

# Run validation
docker run --rm adaptive-alert-triage:latest

# Run evaluation with OpenAI API key
docker run --rm -e OPENAI_API_KEY=your_key adaptive-alert-triage:latest python evaluation/evaluate.py

Usage

Quick Start

from adaptive_alert_triage.env import AdaptiveAlertTriageEnv
from adaptive_alert_triage.models import Action

# Initialize environment with easy task
env = AdaptiveAlertTriageEnv(task_id="easy")

# Reset environment
observation = env.reset()

# Run episode
done = False
total_reward = 0

while not done:
    # Example: investigate first alert
    action = Action(
        alert_id=observation.alerts[0].id,
        action_type="INVESTIGATE"
    )

    observation, reward, done, info = env.step(action)
    total_reward += reward.value

print(f"Episode reward: {total_reward}")
print(f"Task score: {info['task_score']}")

Running Baseline Agents

# Rule-based baseline
python agents/baseline.py --task easy

# OpenAI inference baseline (requires OPENAI_API_KEY)
export OPENAI_API_KEY=your_key_here
python agents/inference.py --task medium

Evaluation

# Run all baselines on all tasks
python evaluation/evaluate.py

# Generate comparison plots
python evaluation/plots.py

Testing

# Run all tests
pytest tests/

# Run with coverage
pytest --cov=src/adaptive_alert_triage tests/

# Run specific test file
pytest tests/test_env.py -v

Docker + RL Server

The environment includes a production-ready FastAPI server for remote RL training.

Architecture

External World (Datadog/Kafka) ──POST /ingest/alerts──> Docker (FastAPI Server)
                                                        │
                                                        │ Internal: AdaptiveAlertTriageEnv
                                                        │ (real + synthetic alerts)
                                                        ↓
External RL Trainer (SB3)      ──/env/reset───────────> │ <──/env/step(action)── Obs/Reward/Done
                                                        │
                                                        ↓
                                                  RL beats baselines! (0.61 → 0.82+)

Quick Start

# 1. Build and run the persistent RL server
docker compose up --build -d

# 2. Verify server health
curl http://localhost:8000/health

# 3. Send real alerts (simulate Datadog webhook)
bash scripts/demo_webhook.sh

# 4. Train external RL agent
pip install stable-baselines3
python train_external.py

# 5. View metrics
curl http://localhost:8000/metrics

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check (env_ready, queue_size)
`/metrics`	GET	RL score vs baseline comparison
`/ingest/alerts`	POST	Webhook receiver for Datadog/Kafka
`/env/reset/{task_id}`	POST	Initialize episode (easy/medium/hard)
`/env/step`	POST	Take RL action, receive obs/reward/done
`/env/state`	GET	Debug: current episode state
`/tasks`	GET	List available tasks
`/ws/train`	WS	Real-time streaming RL loop

WebSocket Training

import websockets
import json

async with websockets.connect("ws://localhost:8000/ws/train") as ws:
    # Reset
    await ws.send(json.dumps({"type": "reset", "task_id": "hard"}))
    obs = await ws.recv()

    # Step loop
    while True:
        await ws.send(json.dumps({
            "type": "step",
            "action": {"alert_id": "A1", "action_type": "INVESTIGATE"}
        }))
        result = await ws.recv()
        if json.loads(result)["done"]:
            break

Project Structure

adaptive_alert_triage_openenv/
├── README.md                   # This file
├── pyproject.toml              # Project metadata and dependencies
├── openenv.yaml                # OpenEnv specification
├── Dockerfile                  # Container build instructions
├── requirements.txt            # Python dependencies
│
├── src/adaptive_alert_triage/  # Core environment implementation
│   ├── __init__.py
│   ├── env.py                  # Main Gym environment
│   ├── models.py               # Pydantic Observation/Action/Reward models
│   └── utils.py                # Helper functions
│
├── tasks/                      # Task definitions and graders
│   ├── easy.py                 # Basic prioritization
│   ├── medium.py               # Resource-constrained triage
│   └── hard.py                 # Cascading failure prevention
│
├── rewards/                    # Reward shaping logic
│   └── reward.py
│
├── agents/                     # Baseline and example agents
│   ├── baseline.py             # Rule-based threshold agent
│   └── inference.py            # OpenAI API baseline
│
├── tests/                      # Unit and integration tests
│   ├── test_env.py
│   ├── test_tasks.py
│   └── test_rewards.py
│
├── evaluation/                 # Performance analysis
│   ├── evaluate.py             # Run benchmarks
│   └── plots.py                # Generate comparison charts
│
└── docker/                     # Docker utilities
    └── entrypoint.sh           # Container startup script

OpenEnv Compliance

This environment adheres to the OpenEnv specification:

✅ Pydantic models for Observation, Action, and Reward
✅ OpenEnv-compatible API (reset(), step(), state())
✅ Task-based evaluation with graders
✅ Reproducible seeding
✅ Docker containerization
✅ openenv.yaml metadata

Contributing

Contributions are welcome! Please follow:

Black code formatting (black .)
Type hints for all functions
Docstrings in Google style
Unit tests for new features

License

MIT License - see LICENSE file for details.