File size: 10,763 Bytes
44f306a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# πŸ”₯ FirewatchEnv β€” Quickstart Guide

> Get from zero to running your first AI SRE agent in under 5 minutes.

---

## What is FirewatchEnv?

FirewatchEnv is an **RL training environment** for autonomous SRE incident response, built for the [Meta PyTorch OpenEnv Hackathon India 2026](https://github.com/meta-pytorch/OpenEnv). Your AI agent acts as an on-call Site Reliability Engineer β€” it receives simulated microservice telemetry (OTel-compatible metrics, Prometheus-style alerts, log excerpts) and must **diagnose and remediate the root cause** before the SLO error budget runs out.

**Key highlights:**
- Single container, no Kubernetes β€” runs on 2 vCPUs / 8 GB RAM
- Three difficulty tiers (Easy β†’ Medium β†’ Hard) with adversarial prompt injection in Task 3
- Outcome-only reward function β€” the agent can't game the grader; it must actually fix the system

---

## Prerequisites

| Tool | Version | Install |
|------|---------|---------|
| **Python** | 3.10+ | [python.org](https://www.python.org/downloads/) |
| **uv** | latest | `pip install uv` or `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
| **Git** | any | [git-scm.com](https://git-scm.com/) |
| **Docker** | latest *(optional β€” only for containerized runs)* | [docker.com](https://docs.docker.com/get-docker/) |

---

## 1 β€” Clone & Install

```bash
git clone https://huggingface.co/spaces/10doshi12/firewatch-env
cd firewatch-env
```

> **Important:** All commands below should be run from inside the `firewatch_env/` directory, which contains the actual environment code.

```bash
cd firewatch_env
uv sync            # installs all Python dependencies from pyproject.toml + uv.lock
```

This installs:
- `openenv-core[core]` β‰₯ 0.2.2 β€” FastAPI server + HTTP client types
- `pydantic` β‰₯ 2.0 β€” data models
- `openai` β‰₯ 1.0 β€” LLM inference via OpenAI-compatible API
- `python-dotenv` β€” `.env` file loading

---

## 2 β€” Configure Environment Variables

Copy the example and fill in your credentials:

```bash
cp .env.example .env
```

Edit `.env`:

```dotenv
# --- LLM Provider (HuggingFace Router) ---
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
HF_TOKEN=hf_your_huggingface_token_here

# --- Server URL (usually auto-detected β€” leave commented for local dev) ---
# SPACE_URL=https://10doshi12-firewatch-env.hf.space
```

Get your HF token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (requires a **Pro** or **Enterprise** plan for router access to gated models).

| Variable | Required | Description |
|----------|----------|-------------|
| `API_BASE_URL` | Yes | HuggingFace Router endpoint (`https://router.huggingface.co/v1`) |
| `MODEL_NAME` | Yes | Model on HF Hub (e.g. `Qwen/Qwen2.5-7B-Instruct`, `Qwen/Qwen2.5-72B-Instruct`) |
| `HF_TOKEN` | No* | HuggingFace API token. *If omitted, inference runs a deterministic rule-based fallback agent (no LLM calls).* |
| `SPACE_URL` | No | Override server URL. Auto-detected in order: `localhost:8000` β†’ `localhost:7860` β†’ HF Space |

---

## 3 β€” Start the Server

```bash
uv run server
```

The FastAPI server starts on **http://localhost:8000** with these endpoints:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/reset` | POST | Reset environment β€” `{"difficulty": "easy", "seed": 42}` |
| `/step` | POST | Execute action β€” `{"action": {"action_type": "fetch_logs", "target_service": "auth-service"}}` |
| `/state` | GET | Get current environment state |
| `/schema` | GET | Action / observation JSON schemas |
| `/ws` | WS | WebSocket for persistent sessions |

### Quick smoke test (new terminal):

```bash
# Reset an easy episode
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "easy", "seed": 42}'

# Take an action
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "fetch_logs", "target_service": "cache"}}'

# Check current state
curl http://localhost:8000/state
```

---

## 4 β€” Run the Inference Agent

With the server running in one terminal, open a **second terminal**:

```bash
cd firewatch_env
python inference.py
```

This runs your agent across all three tasks sequentially:

| Task | Difficulty | Services | Red Herrings | Max Ticks | Seed |
|------|-----------|----------|-------------|-----------|------|
| `task_easy` | Easy | 3 | 0 | 20 | 42 |
| `task_medium` | Medium | 5 | 1 | 30 | 137 |
| `task_hard` | Hard | 7 | 3 (1 adversarial) | 40 | 256 |

### Expected Output

```
[START] task=task_easy env=firewatch-env model=x-ai/grok-4.1-fast
[STEP] step=1 action=fetch_logs:cache reward=-0.14 done=false error=null
[STEP] step=2 action=rollback_deploy:cache reward=-0.14 done=false error=null
...
[END] success=true steps=4 score=0.96 rewards=-0.14,-0.14,-0.14,1.86
```

Each `[STEP]` line shows the action taken, intermediate reward, and whether the episode ended. The `[END]` line reports the final graded score (0.0–1.0).

---

## 5 β€” Docker (Alternative)

Build and run the environment as a Docker container:

```bash
# From the firewatch_env/ directory
docker build -t firewatch-env ./server
docker run -p 7860:7860 firewatch-env
```

The server will be available at **http://localhost:7860**. Set `SPACE_URL=http://localhost:7860` when running `inference.py` (or let auto-detection find it).

---

## 6 β€” Deploy to HuggingFace Spaces

```bash
openenv validate          # must pass with zero errors
openenv push --repo-id 10doshi12/firewatch-env
```

Your environment will be live at `https://10doshi12-firewatch-env.hf.space`.

---

## Project Structure

```
firewatch_env/
β”œβ”€β”€ models.py              # Pydantic models (FirewatchAction, SystemObservation, etc.)
β”œβ”€β”€ simulation.py          # ServiceMesh + generate_episode() + fault physics
β”œβ”€β”€ actions.py             # ActionHandler β€” all 17 action types
β”œβ”€β”€ rewards.py             # RewardEngine + grade() + EpisodeResult
β”œβ”€β”€ config.py              # Constants, TASKS dict, topology (pure data)
β”œβ”€β”€ client.py              # OpenEnv-generated WebSocket client
β”œβ”€β”€ inference.py           # LLM agent loop (stdout eval format)
β”œβ”€β”€ openenv.yaml           # OpenEnv spec definition
β”œβ”€β”€ .env.example           # Environment variable template
β”œβ”€β”€ Dockerfile             # Multi-stage Docker build
β”œβ”€β”€ pyproject.toml         # Dependencies & entry points
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py             # FastAPI application (entry point)
β”‚   └── firewatch_env_environment.py  # Environment wiring
└── tests/
    β”œβ”€β”€ test_integration.py
    β”œβ”€β”€ test_simulation.py
    └── test_inference.py
```

---

## Action Space Reference

### Investigation Actions (read-only)

| Action | Description |
|--------|-------------|
| `fetch_logs` | Populates `recent_logs` on the target service |
| `get_metrics_detail` | Returns 3-tick metric trend summary |
| `trace_dependencies` | Returns full upstream/downstream dependency chain |
| `strace_process` | System-call level process inspection |
| `profiler_dump` | CPU/memory profiler output |
| `check_gc_pressure` | GC pause times and heap pressure |
| `trace_distributed_request` | End-to-end distributed trace |
| `inspect_thread_pool` | Thread pool utilization and deadlock detection |
| `inspect_commit_diff` | Recent deployment diff |

### Remediation Actions (mutate state)

| Action | Description |
|--------|-------------|
| `restart_service` | Resets OOM state; wrong if `error_rate < 0.10` |
| `rollback_deploy` | Halts bad deployment progression |
| `revert_config` | Restores connection pool / config settings |
| `scale_replicas` | Increases memory headroom |
| `circuit_break` | Suppresses cascade for 3 ticks |
| `traffic_shift` | Redirects traffic away from degraded service |

### Meta Actions

| Action | Description |
|--------|-------------|
| `declare_resolved` | Terminates episode and triggers grading |
| `escalate` | Records escalation (no state change) |

---

## Fault Types

| Fault | Signal in Logs | Correct Remediation |
|-------|---------------|---------------------|
| `oom` | OOMKilled, exit code 137 | `restart_service` |
| `bad_deploy` | Error spike post-deployment SHA | `rollback_deploy` |
| `config_drift` | HikariCP pool exhaustion, 30s timeouts | `revert_config` |
| `network_partition` | Connection refused, circuit breaker OPEN | `circuit_break` or `restart_service` |
| `memory_leak` | Gradual latency increase, slow memory growth | `scale_replicas` β†’ `restart_service` |

---

## Scoring

The grader produces a score between **0.0 and 1.0** based on four components:

| Component | Weight | What it Measures |
|-----------|--------|-----------------|
| Recovery | 40% | Did system health improve? |
| Speed | 25% | How quickly was MTTM achieved? |
| Precision | 20% | Were wrong actions avoided? |
| SLO | 15% | How much error budget remained? |

---

## Running Tests

```bash
cd firewatch_env
uv run pytest tests/                                  # all tests
uv run pytest tests/test_integration.py               # integration only
uv run pytest tests/test_simulation.py                # simulation logic
uv run pytest tests/test_integration.py::test_reset_deterministic  # single test
```

---

## Troubleshooting

| Problem | Solution |
|---------|----------|
| `uv: command not found` | Install uv: `pip install uv` or `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
| `openenv-core` import error | Run `uv sync` inside `firewatch_env/` |
| Server won't start | Check port 8000 isn't in use: `lsof -i :8000` |
| `inference.py` can't find server | Server auto-detection probes `localhost:8000` β†’ `localhost:7860`. Ensure the server is running. |
| LLM API errors / 401 | Verify `HF_TOKEN` in `.env`. Without it, the rule-based fallback agent runs (no LLM). |
| Score is 0.0 | Agent didn't call `declare_resolved` or SLO budget hit 0%. Check action logs. |
| Docker build fails | Ensure Docker Desktop is running. Build from `firewatch_env/`: `docker build -t fw ./server` |

---

## Next Steps

- **Swap the model**: Change `MODEL_NAME` in `.env` to test different HF-hosted models (e.g. `Qwen/Qwen2.5-72B-Instruct`, `meta-llama/Llama-3.3-70B-Instruct`)
- **Tune the agent**: Edit `SYSTEM_PROMPT` and `_recovery_hint()` in `inference.py` to improve decision-making
- **Add actions**: Extend `actions.py` with new diagnostic or remediation actions
- **Custom tasks**: Define new scenarios in `config.py` and `openenv.yaml`
- **Benchmark**: Compare scores across models to find the best SRE agent

---

*FirewatchEnv β€” Meta PyTorch OpenEnv Hackathon India 2026*