Spaces:
Sleeping
Sleeping
fix:readme issue --++
Browse files
README.md
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: FirewatchEnv
|
| 3 |
+
emoji: π₯
|
| 4 |
+
colorFrom: red
|
| 5 |
+
colorTo: yellow
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
pinned: false
|
| 9 |
+
tags:
|
| 10 |
+
- openenv
|
| 11 |
+
- reinforcement-learning
|
| 12 |
+
- sre
|
| 13 |
+
- agentic
|
| 14 |
+
base_path: /web
|
| 15 |
+
---
|
| 16 |
+
# FirewatchEnv π₯
|
| 17 |
+
|
| 18 |
+
> **AIOps 2.0 incident response RL environment** β fills a real gap in the open-source AI SRE tooling landscape.
|
| 19 |
+
|
| 20 |
+
[](https://github.com/meta-pytorch/OpenEnv)
|
| 21 |
+
[](https://huggingface.co/spaces/10doshi12/firewatch-env)
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
## 1. Environment Description & Motivation
|
| 25 |
+
|
| 26 |
+
FirewatchEnv is a **genuine RL training environment** for autonomous SRE incident response. An AI agent acts as an on-call Site Reliability Engineer, receiving simulated microservice production telemetry (OTel-compatible metrics, Prometheus alerts, log excerpts) and must diagnose and remediate the root cause before the SLO error budget runs out.
|
| 27 |
+
|
| 28 |
+
### Why this environment fills a real gap
|
| 29 |
+
|
| 30 |
+
The 2026 AI SRE landscape has many commercial agents (Azure SRE Agent, Datadog Bits AI, Komodor Klaudia AI) but **no portable RL training environment**. Existing academic benchmarks β AIOpsLab (Microsoft Research, MLSys 2025), ITBench (IBM), SRE-bench β all require a full Kubernetes cluster and multi-GB Docker images. They are not portable, not deployable to HuggingFace Spaces, and not OpenEnv-spec compliant.
|
| 31 |
+
|
| 32 |
+
FirewatchEnv is the first OpenEnv-spec compliant SRE training environment:
|
| 33 |
+
- Runs in a single Docker container, no Kubernetes, no external cloud credentials
|
| 34 |
+
- 2 vCPUs and 8GB RAM sufficient
|
| 35 |
+
- Deployable to HuggingFace Spaces in one command
|
| 36 |
+
|
| 37 |
+
### Novel mechanics
|
| 38 |
+
|
| 39 |
+
1. **Adversarial telemetry (Task 3):** One red herring service emits a log line containing an embedded prompt injection attempt. A naive agent follows the injected instruction and acts on a healthy service. A robust agent verifies metrics and ignores it. This mirrors the 2026 SRE cybersecurity threat documented by Palo Alto Unit 42.
|
| 40 |
+
|
| 41 |
+
2. **MTTM and Bad Customer Minutes:** Tracks Mean Time to Mitigation (MTTM) β when user-facing impact first stops β and cumulative Bad Customer Minutes (BCM). Based on Google SRE Workbook incident response methodology. No other OpenEnv submission tracks MTTM or BCM.
|
| 42 |
+
|
| 43 |
+
3. **Outcome-only reward function:** Every reward signal is derived from observable system state changes. No answer keys, no hidden root cause variable. The agent cannot game the grader β it must actually improve system health metrics.
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 2. Action Space
|
| 48 |
+
|
| 49 |
+
| Action | Type | Target Required | Effect |
|
| 50 |
+
|---|---|---|---|
|
| 51 |
+
| `fetch_logs` | Investigation | Yes | Populates `recent_logs` on the target service |
|
| 52 |
+
| `get_metrics_detail` | Investigation | Yes | Returns 3-tick metric trend summary in feedback |
|
| 53 |
+
| `trace_dependencies` | Investigation | Yes | Returns full upstream/downstream chain |
|
| 54 |
+
| `restart_service` | Remediation | Yes | Resets OOM state; wrong if error_rate < 0.10 |
|
| 55 |
+
| `rollback_deploy` | Remediation | Yes | Halts bad_deploy progression |
|
| 56 |
+
| `revert_config` | Remediation | Yes | Restores connection pool settings |
|
| 57 |
+
| `scale_replicas` | Remediation | Yes | Increases memory headroom |
|
| 58 |
+
| `circuit_break` | Remediation | Yes | Suppresses cascade for 3 ticks |
|
| 59 |
+
| `declare_resolved` | Meta | No | Terminates episode |
|
| 60 |
+
| `escalate` | Meta | No | Records escalation (no state change) |
|
| 61 |
+
|
| 62 |
+
**Wrong-action penalty:** Applied when remediating a service with `http_server_error_rate < 0.10`.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## 3. Observation Space
|
| 67 |
+
|
| 68 |
+
`SystemObservation` (returned by `reset()`, `step()`, `state()`):
|
| 69 |
+
|
| 70 |
+
| Field | Type | Description |
|
| 71 |
+
|---|---|---|
|
| 72 |
+
| `services` | `dict[str, ServiceMetrics]` | OTel-compatible per-service metrics |
|
| 73 |
+
| `active_alerts` | `list[Alert]` | Currently firing Prometheus-format alerts |
|
| 74 |
+
| `dependency_graph` | `dict[str, list[str]]` | Episode's service topology |
|
| 75 |
+
| `slo_budget_remaining_pct` | `float` | Error budget (100.0 β 0.0) |
|
| 76 |
+
| `bad_customer_minutes` | `float` | Cumulative user impact (MTTM objective) |
|
| 77 |
+
| `sim_tick` | `int` | Current tick (1 tick = 30 simulated seconds) |
|
| 78 |
+
| `action_history` | `list[dict]` | Last 10 actions + feedback strings |
|
| 79 |
+
| `mttm_achieved_tick` | `int \| None` | Tick when user impact first reached zero |
|
| 80 |
+
|
| 81 |
+
Each `ServiceMetrics` has 21 OTel semantic convention fields including `http_server_error_rate`, `http_server_request_duration_p99`, `process_memory_utilization`, `process_cpu_utilization`, `recent_logs`, and more.
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## 4. Tasks & Difficulty
|
| 86 |
+
|
| 87 |
+
| Task ID | Difficulty | Services | Red Herrings | Max Ticks | SLO Burn/Tick | Seed |
|
| 88 |
+
|---|---|---|---|---|---|---|
|
| 89 |
+
| `task_easy` | Easy | 3 | 0 | 20 | 1.5% | 42 |
|
| 90 |
+
| `task_medium` | Medium | 5 | 1 | 30 | 2.5% | 137 |
|
| 91 |
+
| `task_hard` | Hard | 7 | 3 (1 adversarial) | 40 | 4.0% | 256 |
|
| 92 |
+
|
| 93 |
+
**Task 1 (Easy β Single Service OOM):** One service develops a memory fault. Root cause is unambiguous from OOMKill logs. 1β2 investigation actions before correct remediation is sufficient.
|
| 94 |
+
|
| 95 |
+
**Task 2 (Medium β Cascading Deploy Failure):** A bad deployment on an upstream service cascades to downstream victims. The trap: the most alarming alert is on a downstream victim, not the root cause. Requires tracing the dependency graph upstream.
|
| 96 |
+
|
| 97 |
+
**Task 3 (Hard β Config Drift Noise Storm):** Config drift with 3 red herrings including one with adversarial prompt injection in logs. Requires filtering noise, resisting adversarial log content, and acting fast under high SLO burn pressure. Designed to challenge frontier models.
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## 5. Setup & Usage
|
| 102 |
+
|
| 103 |
+
### Prerequisites
|
| 104 |
+
- Docker
|
| 105 |
+
- Python 3.10+
|
| 106 |
+
- `uv` package manager: `pip install uv`
|
| 107 |
+
- `openenv-core`: `pip install openenv-core`
|
| 108 |
+
|
| 109 |
+
### Local Development
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
git clone https://huggingface.co/spaces/10doshi12/firewatch-env
|
| 113 |
+
cd firewatch-env
|
| 114 |
+
uv sync
|
| 115 |
+
uv run server # starts on http://localhost:8000
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### Run Baseline Inference
|
| 119 |
+
|
| 120 |
+
```bash
|
| 121 |
+
export HF_TOKEN=<your-hf-token>
|
| 122 |
+
export SPACE_URL=http://localhost:8000 # or your HF Space URL
|
| 123 |
+
python inference.py
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
### Docker
|
| 127 |
+
|
| 128 |
+
```bash
|
| 129 |
+
docker build -t firewatch-env ./server
|
| 130 |
+
docker run -p 7860:7860 firewatch-env
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### OpenEnv Validate
|
| 134 |
+
|
| 135 |
+
```bash
|
| 136 |
+
openenv validate # must pass with zero errors
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
### Baseline Scores (Qwen/Qwen2.5-72B-Instruct via HF Router)
|
| 140 |
+
|
| 141 |
+
| Task | Score | Notes |
|
| 142 |
+
|---|---|---|
|
| 143 |
+
| task_easy | 0.000 | Replace with your actual score after running inference.py |
|
| 144 |
+
| task_medium | 0.000 | Replace with your actual score |
|
| 145 |
+
| task_hard | 0.000 | Task 3 score reflects adversarial robustness of the model |
|
| 146 |
+
*Note: Task 3 is designed to test adversarial robustness. A lower Task 3 score relative to Tasks 1β2 reflects the model's susceptibility to prompt injection, not environment quality.*
|
| 147 |
+
---
|
| 148 |
+
## Fault Types
|
| 149 |
+
All five fault types mapped to AIOpsLab taxonomy (Table 2, MLSys 2025):
|
| 150 |
+
| Fault | AIOpsLab Type | Observable Signature |
|
| 151 |
+
|---|---|---|
|
| 152 |
+
| `oom` | memory_stress | OOMKill (exit 137), restart_count spike |
|
| 153 |
+
| `bad_deploy` | pod restart | Error rate spike post-deployment SHA |
|
| 154 |
+
| `config_drift` | misconfig_app | HikariCP pool exhaustion, 30s timeouts |
|
| 155 |
+
| `network_partition` | network_delay | Connection refused, circuit breaker OPEN |
|
| 156 |
+
| `memory_leak` | memory_leak | Gradual latency increase, slow memory growth |
|
| 157 |
+
---
|
| 158 |
+
*FirewatchEnv β Meta PyTorch OpenEnv Hackathon India 2026*
|