Spaces:
Sleeping
Sleeping
| title: Adaptive Alert Triage & Incident Response | |
| emoji: π¨ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| sdk_version: "latest" | |
| python_version: "3.11" | |
| pinned: false | |
| app_port: 7860 | |
| # Adaptive Alert Triage & Incident Response Environment (OpenEnv) | |
| **Version**: 0.1.0 | |
| **Framework**: OpenEnv | |
| **Status**: Alpha | |
| ## Overview | |
| An OpenEnv-compliant reinforcement learning environment that simulates real-time IT alert triage and incident response. Agents must intelligently prioritize alerts under resource constraints while preventing cascading system failures in a partially observable, dynamic environment. | |
| ### Why RL Over Rule-Based Systems? | |
| | **Challenge** | **Rule-Based Limitation** | **RL Advantage** | | |
| | --------------------------- | ---------------------------------------------------------- | ------------------------------------------------------ | | |
| | **Dynamic Patterns** | Static thresholds fail as alert patterns evolve | Learns from feedback, adapts to changing distributions | | |
| | **Context Awareness** | Cannot capture alert correlations or temporal dependencies | Discovers hidden relationships through experience | | |
| | **Resource Optimization** | Fixed allocation ignores varying system states | Optimizes action selection under real-time constraints | | |
| | **False Positive Handling** | Uniform treatment leads to alert fatigue | Learns nuanced confidence signals and noise patterns | | |
| | **Cascading Failures** | Reactive approach misses early warning signs | Proactive detection through predictive state modeling | | |
| ## Environment Specification | |
| ### State Space (Partial Observability) | |
| **Visible Features:** | |
| - `alerts`: List of active alerts with: | |
| - `id`: Unique alert identifier | |
| - `visible_severity`: Noisy severity score (0.0-1.0) | |
| - `confidence`: Detection confidence (0.0-1.0) | |
| - `alert_type`: Category (CPU, MEMORY, DISK, NETWORK, APPLICATION, SECURITY) | |
| - `age`: Time steps since alert generation | |
| - `system_load`: Current system resource utilization (0.0-1.0) | |
| - `queue_length`: Number of unprocessed alerts | |
| - `time_remaining`: Steps left in episode | |
| **Hidden Features** (ground truth for reward computation): | |
| - `true_severity`: Actual criticality of each alert | |
| - `correlations`: Alert dependency graph | |
| - `future_failures`: Predicted cascading failure probabilities | |
| ### Action Space | |
| Per alert, the agent can execute: | |
| - **INVESTIGATE**: Allocate resources to diagnose (costly but resolves critical issues) | |
| - **IGNORE**: Mark as noise (efficient for false positives) | |
| - **ESCALATE**: Route to specialist team (high-confidence critical alerts) | |
| - **DELAY**: Defer to next time step (queue management) | |
| **Resource Constraints**: Maximum K investigations per time step (task-dependent). | |
| ### Reward Structure | |
| ```python | |
| +10 # Critical alert correctly investigated | |
| +5 # Cascading failure prevented through correlation detection | |
| +3 # False positive correctly ignored | |
| -2 # Unnecessary investigation (resource waste) | |
| -8 # Missed critical alert | |
| -10 # System failure due to ignored critical issue | |
| ``` | |
| ### Episode Dynamics | |
| - **Length**: 20-50 time steps (task-dependent) | |
| - **Termination**: Max steps reached OR failure threshold exceeded | |
| - **Alert Generation**: Continuous stochastic process with temporal correlation | |
| - **Failure Mechanics**: Ignored critical alerts accumulate damage, triggering cascading failures | |
| ## Tasks | |
| ### 1. Easy: Basic Alert Prioritization | |
| **Objective**: Correctly classify and handle alerts based on visible signals. | |
| **Success Criteria**: β₯70% correct action rate | |
| **Key Challenge**: Distinguish genuine critical alerts from noise | |
| **Grading**: `correct_actions / total_actions` | |
| ### 2. Medium: Resource-Constrained Triage | |
| **Objective**: Optimize triage under strict investigation limits. | |
| **Success Criteria**: β₯65% weighted efficiency score | |
| **Key Challenge**: Maximize critical alert resolution with limited resources | |
| **Grading**: `(weighted_resolved_alerts * resource_efficiency)` | |
| ### 3. Hard: Cascading Failures Prevention | |
| **Objective**: Detect correlated alerts and prevent future failures. | |
| **Success Criteria**: β₯60% score with stability requirements | |
| **Key Challenge**: Infer hidden correlations and predict failure chains | |
| **Grading**: `(prevented_failures - system_instability_penalty) / max_possible` | |
| ## Installation | |
| ### Local Setup | |
| ```bash | |
| # Clone repository | |
| git clone https://github.com/scalar/adaptive-alert-triage.git | |
| cd adaptive-alert-triage | |
| # Create virtual environment | |
| python -m venv venv | |
| source venv/bin/activate # On Windows: venv\Scripts\activate | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Install package in editable mode | |
| pip install -e . | |
| ``` | |
| ### Docker Setup | |
| ```bash | |
| # Build Docker image | |
| docker build -t adaptive-alert-triage:latest . | |
| # Run validation | |
| docker run --rm adaptive-alert-triage:latest | |
| # Run evaluation with OpenAI API key | |
| docker run --rm -e OPENAI_API_KEY=your_key adaptive-alert-triage:latest python evaluation/evaluate.py | |
| ``` | |
| ## Usage | |
| ### Quick Start | |
| ```python | |
| from adaptive_alert_triage.env import AdaptiveAlertTriageEnv | |
| from adaptive_alert_triage.models import Action | |
| # Initialize environment with easy task | |
| env = AdaptiveAlertTriageEnv(task_id="easy") | |
| # Reset environment | |
| observation = env.reset() | |
| # Run episode | |
| done = False | |
| total_reward = 0 | |
| while not done: | |
| # Example: investigate first alert | |
| action = Action( | |
| alert_id=observation.alerts[0].id, | |
| action_type="INVESTIGATE" | |
| ) | |
| observation, reward, done, info = env.step(action) | |
| total_reward += reward.value | |
| print(f"Episode reward: {total_reward}") | |
| print(f"Task score: {info['task_score']}") | |
| ``` | |
| ### Running Baseline Agents | |
| ```bash | |
| # Rule-based baseline | |
| python agents/baseline.py --task easy | |
| # OpenAI inference baseline (requires OPENAI_API_KEY) | |
| export OPENAI_API_KEY=your_key_here | |
| python agents/inference.py --task medium | |
| ``` | |
| ### Evaluation | |
| ```bash | |
| # Run all baselines on all tasks | |
| python evaluation/evaluate.py | |
| # Generate comparison plots | |
| python evaluation/plots.py | |
| ``` | |
| ## Testing | |
| ```bash | |
| # Run all tests | |
| pytest tests/ | |
| # Run with coverage | |
| pytest --cov=src/adaptive_alert_triage tests/ | |
| # Run specific test file | |
| pytest tests/test_env.py -v | |
| ``` | |
| ## Docker + RL Server | |
| The environment includes a production-ready FastAPI server for remote RL training. | |
| ### Architecture | |
| ``` | |
| External World (Datadog/Kafka) ββPOST /ingest/alertsββ> Docker (FastAPI Server) | |
| β | |
| β Internal: AdaptiveAlertTriageEnv | |
| β (real + synthetic alerts) | |
| β | |
| External RL Trainer (SB3) ββ/env/resetβββββββββββ> β <ββ/env/step(action)ββ Obs/Reward/Done | |
| β | |
| β | |
| RL beats baselines! (0.61 β 0.82+) | |
| ``` | |
| ### Quick Start | |
| ```bash | |
| # 1. Build and run the persistent RL server | |
| docker compose up --build -d | |
| # 2. Verify server health | |
| curl http://localhost:8000/health | |
| # 3. Send real alerts (simulate Datadog webhook) | |
| bash scripts/demo_webhook.sh | |
| # 4. Train external RL agent | |
| pip install stable-baselines3 | |
| python train_external.py | |
| # 5. View metrics | |
| curl http://localhost:8000/metrics | |
| ``` | |
| ### API Endpoints | |
| | Endpoint | Method | Description | | |
| | ---------------------- | ------ | --------------------------------------- | | |
| | `/health` | GET | Health check (env_ready, queue_size) | | |
| | `/metrics` | GET | RL score vs baseline comparison | | |
| | `/ingest/alerts` | POST | Webhook receiver for Datadog/Kafka | | |
| | `/env/reset/{task_id}` | POST | Initialize episode (easy/medium/hard) | | |
| | `/env/step` | POST | Take RL action, receive obs/reward/done | | |
| | `/env/state` | GET | Debug: current episode state | | |
| | `/tasks` | GET | List available tasks | | |
| | `/ws/train` | WS | Real-time streaming RL loop | | |
| ### WebSocket Training | |
| ```python | |
| import websockets | |
| import json | |
| async with websockets.connect("ws://localhost:8000/ws/train") as ws: | |
| # Reset | |
| await ws.send(json.dumps({"type": "reset", "task_id": "hard"})) | |
| obs = await ws.recv() | |
| # Step loop | |
| while True: | |
| await ws.send(json.dumps({ | |
| "type": "step", | |
| "action": {"alert_id": "A1", "action_type": "INVESTIGATE"} | |
| })) | |
| result = await ws.recv() | |
| if json.loads(result)["done"]: | |
| break | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| adaptive_alert_triage_openenv/ | |
| βββ README.md # This file | |
| βββ pyproject.toml # Project metadata and dependencies | |
| βββ openenv.yaml # OpenEnv specification | |
| βββ Dockerfile # Container build instructions | |
| βββ requirements.txt # Python dependencies | |
| β | |
| βββ src/adaptive_alert_triage/ # Core environment implementation | |
| β βββ __init__.py | |
| β βββ env.py # Main Gym environment | |
| β βββ models.py # Pydantic Observation/Action/Reward models | |
| β βββ utils.py # Helper functions | |
| β | |
| βββ tasks/ # Task definitions and graders | |
| β βββ easy.py # Basic prioritization | |
| β βββ medium.py # Resource-constrained triage | |
| β βββ hard.py # Cascading failure prevention | |
| β | |
| βββ rewards/ # Reward shaping logic | |
| β βββ reward.py | |
| β | |
| βββ agents/ # Baseline and example agents | |
| β βββ baseline.py # Rule-based threshold agent | |
| β βββ inference.py # OpenAI API baseline | |
| β | |
| βββ tests/ # Unit and integration tests | |
| β βββ test_env.py | |
| β βββ test_tasks.py | |
| β βββ test_rewards.py | |
| β | |
| βββ evaluation/ # Performance analysis | |
| β βββ evaluate.py # Run benchmarks | |
| β βββ plots.py # Generate comparison charts | |
| β | |
| βββ docker/ # Docker utilities | |
| βββ entrypoint.sh # Container startup script | |
| ``` | |
| ## OpenEnv Compliance | |
| This environment adheres to the OpenEnv specification: | |
| - β Pydantic models for Observation, Action, and Reward | |
| - β OpenEnv-compatible API (`reset()`, `step()`, `state()`) | |
| - β Task-based evaluation with graders | |
| - β Reproducible seeding | |
| - β Docker containerization | |
| - β `openenv.yaml` metadata | |
| ## Contributing | |
| Contributions are welcome! Please follow: | |
| 1. Black code formatting (`black .`) | |
| 2. Type hints for all functions | |
| 3. Docstrings in Google style | |
| 4. Unit tests for new features | |
| ## License | |
| MIT License - see LICENSE file for details. | |