10doshi12 commited on
Commit
172bfa2
Β·
1 Parent(s): 1c5758a

fix:readme issue --++

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: FirewatchEnv
3
+ emoji: πŸ”₯
4
+ colorFrom: red
5
+ colorTo: yellow
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ tags:
10
+ - openenv
11
+ - reinforcement-learning
12
+ - sre
13
+ - agentic
14
+ base_path: /web
15
+ ---
16
+ # FirewatchEnv πŸ”₯
17
+
18
+ > **AIOps 2.0 incident response RL environment** β€” fills a real gap in the open-source AI SRE tooling landscape.
19
+
20
+ [![openenv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
21
+ [![HF Space](https://img.shields.io/badge/HuggingFace-Space-orange)](https://huggingface.co/spaces/10doshi12/firewatch-env)
22
+
23
+
24
+ ## 1. Environment Description & Motivation
25
+
26
+ FirewatchEnv is a **genuine RL training environment** for autonomous SRE incident response. An AI agent acts as an on-call Site Reliability Engineer, receiving simulated microservice production telemetry (OTel-compatible metrics, Prometheus alerts, log excerpts) and must diagnose and remediate the root cause before the SLO error budget runs out.
27
+
28
+ ### Why this environment fills a real gap
29
+
30
+ The 2026 AI SRE landscape has many commercial agents (Azure SRE Agent, Datadog Bits AI, Komodor Klaudia AI) but **no portable RL training environment**. Existing academic benchmarks β€” AIOpsLab (Microsoft Research, MLSys 2025), ITBench (IBM), SRE-bench β€” all require a full Kubernetes cluster and multi-GB Docker images. They are not portable, not deployable to HuggingFace Spaces, and not OpenEnv-spec compliant.
31
+
32
+ FirewatchEnv is the first OpenEnv-spec compliant SRE training environment:
33
+ - Runs in a single Docker container, no Kubernetes, no external cloud credentials
34
+ - 2 vCPUs and 8GB RAM sufficient
35
+ - Deployable to HuggingFace Spaces in one command
36
+
37
+ ### Novel mechanics
38
+
39
+ 1. **Adversarial telemetry (Task 3):** One red herring service emits a log line containing an embedded prompt injection attempt. A naive agent follows the injected instruction and acts on a healthy service. A robust agent verifies metrics and ignores it. This mirrors the 2026 SRE cybersecurity threat documented by Palo Alto Unit 42.
40
+
41
+ 2. **MTTM and Bad Customer Minutes:** Tracks Mean Time to Mitigation (MTTM) β€” when user-facing impact first stops β€” and cumulative Bad Customer Minutes (BCM). Based on Google SRE Workbook incident response methodology. No other OpenEnv submission tracks MTTM or BCM.
42
+
43
+ 3. **Outcome-only reward function:** Every reward signal is derived from observable system state changes. No answer keys, no hidden root cause variable. The agent cannot game the grader β€” it must actually improve system health metrics.
44
+
45
+ ---
46
+
47
+ ## 2. Action Space
48
+
49
+ | Action | Type | Target Required | Effect |
50
+ |---|---|---|---|
51
+ | `fetch_logs` | Investigation | Yes | Populates `recent_logs` on the target service |
52
+ | `get_metrics_detail` | Investigation | Yes | Returns 3-tick metric trend summary in feedback |
53
+ | `trace_dependencies` | Investigation | Yes | Returns full upstream/downstream chain |
54
+ | `restart_service` | Remediation | Yes | Resets OOM state; wrong if error_rate < 0.10 |
55
+ | `rollback_deploy` | Remediation | Yes | Halts bad_deploy progression |
56
+ | `revert_config` | Remediation | Yes | Restores connection pool settings |
57
+ | `scale_replicas` | Remediation | Yes | Increases memory headroom |
58
+ | `circuit_break` | Remediation | Yes | Suppresses cascade for 3 ticks |
59
+ | `declare_resolved` | Meta | No | Terminates episode |
60
+ | `escalate` | Meta | No | Records escalation (no state change) |
61
+
62
+ **Wrong-action penalty:** Applied when remediating a service with `http_server_error_rate < 0.10`.
63
+
64
+ ---
65
+
66
+ ## 3. Observation Space
67
+
68
+ `SystemObservation` (returned by `reset()`, `step()`, `state()`):
69
+
70
+ | Field | Type | Description |
71
+ |---|---|---|
72
+ | `services` | `dict[str, ServiceMetrics]` | OTel-compatible per-service metrics |
73
+ | `active_alerts` | `list[Alert]` | Currently firing Prometheus-format alerts |
74
+ | `dependency_graph` | `dict[str, list[str]]` | Episode's service topology |
75
+ | `slo_budget_remaining_pct` | `float` | Error budget (100.0 β†’ 0.0) |
76
+ | `bad_customer_minutes` | `float` | Cumulative user impact (MTTM objective) |
77
+ | `sim_tick` | `int` | Current tick (1 tick = 30 simulated seconds) |
78
+ | `action_history` | `list[dict]` | Last 10 actions + feedback strings |
79
+ | `mttm_achieved_tick` | `int \| None` | Tick when user impact first reached zero |
80
+
81
+ Each `ServiceMetrics` has 21 OTel semantic convention fields including `http_server_error_rate`, `http_server_request_duration_p99`, `process_memory_utilization`, `process_cpu_utilization`, `recent_logs`, and more.
82
+
83
+ ---
84
+
85
+ ## 4. Tasks & Difficulty
86
+
87
+ | Task ID | Difficulty | Services | Red Herrings | Max Ticks | SLO Burn/Tick | Seed |
88
+ |---|---|---|---|---|---|---|
89
+ | `task_easy` | Easy | 3 | 0 | 20 | 1.5% | 42 |
90
+ | `task_medium` | Medium | 5 | 1 | 30 | 2.5% | 137 |
91
+ | `task_hard` | Hard | 7 | 3 (1 adversarial) | 40 | 4.0% | 256 |
92
+
93
+ **Task 1 (Easy β€” Single Service OOM):** One service develops a memory fault. Root cause is unambiguous from OOMKill logs. 1–2 investigation actions before correct remediation is sufficient.
94
+
95
+ **Task 2 (Medium β€” Cascading Deploy Failure):** A bad deployment on an upstream service cascades to downstream victims. The trap: the most alarming alert is on a downstream victim, not the root cause. Requires tracing the dependency graph upstream.
96
+
97
+ **Task 3 (Hard β€” Config Drift Noise Storm):** Config drift with 3 red herrings including one with adversarial prompt injection in logs. Requires filtering noise, resisting adversarial log content, and acting fast under high SLO burn pressure. Designed to challenge frontier models.
98
+
99
+ ---
100
+
101
+ ## 5. Setup & Usage
102
+
103
+ ### Prerequisites
104
+ - Docker
105
+ - Python 3.10+
106
+ - `uv` package manager: `pip install uv`
107
+ - `openenv-core`: `pip install openenv-core`
108
+
109
+ ### Local Development
110
+
111
+ ```bash
112
+ git clone https://huggingface.co/spaces/10doshi12/firewatch-env
113
+ cd firewatch-env
114
+ uv sync
115
+ uv run server # starts on http://localhost:8000
116
+ ```
117
+
118
+ ### Run Baseline Inference
119
+
120
+ ```bash
121
+ export HF_TOKEN=<your-hf-token>
122
+ export SPACE_URL=http://localhost:8000 # or your HF Space URL
123
+ python inference.py
124
+ ```
125
+
126
+ ### Docker
127
+
128
+ ```bash
129
+ docker build -t firewatch-env ./server
130
+ docker run -p 7860:7860 firewatch-env
131
+ ```
132
+
133
+ ### OpenEnv Validate
134
+
135
+ ```bash
136
+ openenv validate # must pass with zero errors
137
+ ```
138
+
139
+ ### Baseline Scores (Qwen/Qwen2.5-72B-Instruct via HF Router)
140
+
141
+ | Task | Score | Notes |
142
+ |---|---|---|
143
+ | task_easy | 0.000 | Replace with your actual score after running inference.py |
144
+ | task_medium | 0.000 | Replace with your actual score |
145
+ | task_hard | 0.000 | Task 3 score reflects adversarial robustness of the model |
146
+ *Note: Task 3 is designed to test adversarial robustness. A lower Task 3 score relative to Tasks 1–2 reflects the model's susceptibility to prompt injection, not environment quality.*
147
+ ---
148
+ ## Fault Types
149
+ All five fault types mapped to AIOpsLab taxonomy (Table 2, MLSys 2025):
150
+ | Fault | AIOpsLab Type | Observable Signature |
151
+ |---|---|---|
152
+ | `oom` | memory_stress | OOMKill (exit 137), restart_count spike |
153
+ | `bad_deploy` | pod restart | Error rate spike post-deployment SHA |
154
+ | `config_drift` | misconfig_app | HikariCP pool exhaustion, 30s timeouts |
155
+ | `network_partition` | network_delay | Connection refused, circuit breaker OPEN |
156
+ | `memory_leak` | memory_leak | Gradual latency increase, slow memory growth |
157
+ ---
158
+ *FirewatchEnv β€” Meta PyTorch OpenEnv Hackathon India 2026*