Eshit commited on
Commit
d377d79
Β·
verified Β·
1 Parent(s): 356f98f

Update README

Browse files
Files changed (1) hide show
  1. README.md +364 -361
README.md CHANGED
@@ -1,361 +1,364 @@
1
- ---
2
- title: Wildfire Containment Simulator
3
- emoji: πŸ”₯
4
- colorFrom: red
5
- colorTo: purple
6
- sdk: docker
7
- pinned: false
8
- license: mit
9
- tags:
10
- - reinforcement-learning
11
- - simulation
12
- - openenv
13
- - wildfire
14
- - rl-environment
15
- - long-horizon
16
- - instruction-following
17
- ---
18
-
19
- # Wildfire Containment Simulator
20
-
21
- **Meta OpenEnv Hackathon β€” Theme 2: Long-Horizon Planning & Instruction Following**
22
-
23
- ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
24
- ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
25
- ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
26
- ![Python](https://img.shields.io/badge/Python-3.11+-blue)
27
- ![License](https://img.shields.io/badge/License-MIT-green)
28
-
29
- A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
30
-
31
- > **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **{TBD}** on Hard tier β€” vs. **+4.74** for the rule-based heuristic and **+2.16** for the random baseline.
32
- > *(Numbers are filled in after `scripts/eval_trained_model.py` completes; see [Results](#results).)*
33
-
34
- ---
35
-
36
- ## πŸ”— Quick Links
37
-
38
- | Resource | Link |
39
- |---|---|
40
- | πŸš€ **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
41
- | πŸ’» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
42
- | πŸ“’ **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
43
- | πŸ“’ **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
44
- | πŸ“ **Long-form blog post** | [`BLOG.md`](BLOG.md) |
45
- | πŸ“Š **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
46
- | πŸ“ˆ **Training dashboard** | [`training/training_dashboard.png`](training/training_dashboard.png) *(generated post-run)* |
47
- | 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
48
- | πŸŽ₯ **2-minute pitch video** | *(YouTube link coming soon)* |
49
-
50
- ---
51
-
52
- ## Why Theme 2
53
-
54
- | Pillar | How we model it |
55
- |---|---|
56
- | **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β€” greedy local moves cannot capture it. |
57
- | **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
58
- | **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
59
-
60
- ---
61
-
62
- ## Real-World Motivation
63
-
64
- Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β€” partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β€” so an LLM can be trained, evaluated, and inspected on it end-to-end.
65
-
66
- For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
67
-
68
- ---
69
-
70
- ## Quickstart
71
-
72
- ```bash
73
- # Clone and install
74
- git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
75
- cd Wildfire-Containment-Simulator
76
- uv pip install -r requirements.txt
77
- uv pip install -e .
78
-
79
- # Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
80
- python scripts/evaluate.py 5
81
-
82
- # Compare agents head-to-head
83
- python scripts/eval_compare.py --seeds 42 43 44 45 46 \
84
- --tiers easy medium hard --agents random heuristic
85
-
86
- # Render an episode as a GIF
87
- python scripts/replay.py --tier medium --seed 42 \
88
- --agent heuristic --output demos/replay.gif
89
-
90
- # Spin up the OpenEnv FastAPI server locally on port 7860
91
- python server/app.py
92
- # Then visit http://localhost:7860/ui/ for the interactive frontend
93
- ```
94
-
95
- Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
96
-
97
- ---
98
-
99
- ## Live Hugging Face Space
100
-
101
- The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β€” no Python import needed:
102
-
103
- ```bash
104
- SPACE=https://eshit-wildfire-containment-simulator.hf.space
105
-
106
- curl "$SPACE/health"
107
- curl -X POST "$SPACE/reset?task_id=easy&seed=42"
108
- curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
109
- -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
110
- ```
111
-
112
- Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
113
-
114
- ---
115
-
116
- ## Environment API
117
-
118
- ```python
119
- from env import WildfireEnv, Action, ActionType, Direction
120
-
121
- env = WildfireEnv()
122
- obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
123
-
124
- while not env.done:
125
- action = Action(
126
- action_type=ActionType.DEPLOY_CREW,
127
- crew_id="crew_0",
128
- target_row=7, target_col=7,
129
- )
130
- result = env.step(action) # StepResult
131
- obs = result.observation
132
- reward = result.reward # decomposed float, range ~βˆ’8 to +8
133
- done = result.done
134
-
135
- state = env.state() # Full ground truth (grading only)
136
- ```
137
-
138
- `reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β€” agents must work from `Observation`.
139
-
140
- ---
141
-
142
- ## Action Space
143
-
144
- All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
145
-
146
- | Action | Required parameters | Description |
147
- |---|---|---|
148
- | `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
149
- | `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
150
- | `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
151
- | `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ—3 retardant drop with 5-step cooldown |
152
- | `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
153
- | `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ—10 area for 5 steps |
154
- | `idle` | `reason` *(optional)* | Explicitly wait |
155
-
156
- A 3-layer parser (`env/action_parser.py`) maps raw LLM output β†’ structured `Action`: direct JSON β†’ regex field extraction β†’ safe-`idle` fallback. **The environment loop never breaks on bad model output.**
157
-
158
- ---
159
-
160
- ## Observation Space
161
-
162
- | Component | Contents | Noise / occlusion |
163
- |---|---|---|
164
- | `briefing` | `OperationalBriefing` on first obs β€” incident ID, priority zones, infrastructure, wind forecast | First step only |
165
- | `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
166
- | `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard |
167
- | `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
168
- | `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
169
- | `recent_events` | Last 5 notable events | Fully observable |
170
-
171
- The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
172
-
173
- ---
174
-
175
- ## Reward Function
176
-
177
- Decomposed for GRPO β€” wide reward range produces meaningful advantages between rollout groups.
178
-
179
- **Per-step (dense):**
180
- ```
181
- step_reward = 0.4 Β· Ξ”containment + 0.4 Β· Ξ”population_safety βˆ’ 0.1 Β· redundant_action_flag
182
- ```
183
-
184
- **Terminal (sparse, on episode end):**
185
- ```
186
- +5.0 if all populations safe
187
- +0–2.0 efficiency bonus (faster containment β‡’ more)
188
- +1.0 briefing-adherence bonus (all priority zones survived)
189
- βˆ’3.0 Β· (pop_lost / total_pop) if any population lost
190
- βˆ’2.0 if any crew casualty
191
- βˆ’0.01 Γ— invalid_action_count capped at βˆ’0.2
192
- ```
193
-
194
- Total empirical range: **βˆ’8 to +8**, declared in `openenv.yaml`.
195
-
196
- | Tier | Spread scale | Episode length | Approx. reward ceiling |
197
- |---|---|---|---|
198
- | Easy | 1.00Γ— | 80 | +8 |
199
- | Medium | 0.70Γ— | 150 | +7 |
200
- | Hard | 0.55Γ— | 300 | +6 |
201
-
202
- ---
203
-
204
- ## Three Difficulty Tiers
205
-
206
- ### Task 1 β€” Easy: Flatland Grass Fire
207
- 15Γ—15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control.
208
-
209
- ### Task 2 β€” Medium: Canyon Terrain with Wind Shifts
210
- 25Γ—25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
211
-
212
- ### Task 3 β€” Hard: Wildland-Urban Interface Crisis
213
- 40Γ—40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
214
-
215
- ---
216
-
217
- ## Fire Spread Model
218
-
219
- A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
220
-
221
- ```
222
- P(ignite) = base_rate Γ— fuel_factor Γ— wind_factor Γ— slope_factor
223
- Γ— (1 βˆ’ moisture) Γ— (1 βˆ’ suppression) Γ— tier_scale
224
- ```
225
-
226
- | Factor | Effect |
227
- |---|---|
228
- | `base_rate` | Baseline spread by fuel type |
229
- | `fuel_factor` | Fuel load of the target cell |
230
- | `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
231
- | `slope_factor` | Faster uphill, slower downhill |
232
- | `moisture` | Wet ground / recent rain reduces ignition probability |
233
- | `suppression` | Crew presence and retardant coverage reduce spread |
234
- | `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
235
-
236
- Burning cells progress through `BURNING β†’ EMBER β†’ BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
237
-
238
- ---
239
-
240
- ## Results
241
-
242
- > Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers are produced by `python scripts/eval_trained_model.py --num-seeds 15` on held-out seeds 200–214 (no overlap with training seeds 0–99).
243
-
244
- | Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
245
- |---|---|---|---|
246
- | Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
247
- | Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | +4.74 Β± 3.79 |
248
- | **Trained Qwen-2.5-7B (ours)** | **{TBD}** | **{TBD}** | **{TBD}** |
249
- | **Ξ” vs. Heuristic** | **{TBD}** | **{TBD}** | **{TBD}** |
250
-
251
- **Auxiliary metrics for the trained agent** (filled in post-eval):
252
-
253
- | Metric | Easy | Medium | Hard |
254
- |---|---|---|---|
255
- | JSON success rate | {TBD} | {TBD} | {TBD} |
256
- | Mean population saved % | {TBD} | {TBD} | {TBD} |
257
- | Crew casualty rate | {TBD} | {TBD} | {TBD} |
258
-
259
- > See `scripts/trained_results.json` (post-eval) for the raw scores.
260
-
261
- ---
262
-
263
- ## Training
264
-
265
- We use a two-stage recipe:
266
-
267
- 1. **SFT warm-up** β€” generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
268
- 2. **GRPO (TRL `GRPOTrainer`)** β€” start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy β†’ medium β†’ hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
269
-
270
- **Hardware:** A10G Large (24 GB) on a Hugging Face Space JupyterLab session.
271
- **Training stack:** `unsloth` (4-bit QLoRA), `trl==0.15.2`, `datasets==3.4.1`, `transformers`, `peft`, `wandb`. Pinned in [`training/requirements.txt`](training/requirements.txt).
272
-
273
- **Training plots:** dashboard PNG at [`training/training_dashboard.png`](training/training_dashboard.png) (4-panel: episode reward, population-survival rate, containment %, curriculum tier timeline). W&B run: *(link added post-run)*.
274
-
275
- For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
276
-
277
- ---
278
-
279
- ## Project Structure
280
-
281
- ```text
282
- Wildfire-Containment-Simulator/
283
- β”œβ”€β”€ env/
284
- β”‚ β”œβ”€β”€ wildfire_env.py # Main env: reset(), step(), state()
285
- β”‚ β”œβ”€β”€ models.py # Pydantic action/observation/state models
286
- β”‚ β”œβ”€β”€ grid.py # Terrain, smoke, moisture, fog-of-war
287
- β”‚ β”œβ”€β”€ fire_spread.py # Cellular automaton fire propagation
288
- β”‚ β”œβ”€β”€ weather.py # Stochastic weather engine
289
- β”‚ β”œβ”€β”€ resources.py # Crews, tankers, firebreaks, recon
290
- β”‚ β”œβ”€β”€ reward.py # Decomposed step + terminal reward
291
- β”‚ β”œβ”€β”€ briefing.py # OperationalBriefing generation
292
- β”‚ β”œβ”€β”€ serialization.py # Observation β†’ LLM prompt
293
- β”‚ β”œβ”€β”€ action_parser.py # LLM output β†’ Action (3-layer fallback)
294
- β”‚ β”œβ”€β”€ rendering.py # Frame rendering for GIF replays
295
- β”‚ └── curriculum.py # CurriculumController (auto-promote/demote)
296
- β”œβ”€β”€ agents/
297
- β”‚ β”œβ”€β”€ random_agent.py
298
- β”‚ └── heuristic_agent.py
299
- β”œβ”€β”€ graders/
300
- β”‚ β”œβ”€β”€ grader_easy.py # β†’ (total_reward, details_dict)
301
- β”‚ β”œβ”€β”€ grader_medium.py
302
- β”‚ └── grader_hard.py
303
- β”œβ”€β”€ scripts/
304
- β”‚ β”œβ”€β”€ evaluate.py # Baseline eval (random + heuristic)
305
- β”‚ β”œβ”€β”€ eval_compare.py # Multi-agent comparison
306
- β”‚ β”œβ”€β”€ eval_trained_model.py # Evaluate a trained adapter
307
- β”‚ β”œβ”€β”€ generate_sft_data.py # Build SFT dataset from heuristic rollouts
308
- β”‚ β”œβ”€β”€ replay.py # Render episode as GIF
309
- β”‚ β”œβ”€β”€ run_demo.py # Pitch demo
310
- β”‚ └── plot_dashboard.py # 4-panel training curves
311
- β”œβ”€β”€ training/
312
- β”‚ β”œβ”€β”€ grpo_v2_colab.ipynb # GRPO notebook (canonical)
313
- β”‚ β”œβ”€β”€ sft_colab.ipynb # SFT warm-up notebook
314
- β”‚ β”œβ”€β”€ sft_data.jsonl # 4,300 SFT examples
315
- β”‚ β”œβ”€β”€ requirements.txt # Training deps (Unsloth, TRL, etc.)
316
- β”‚ └── README.md
317
- β”œβ”€β”€ server/
318
- β”‚ └── app.py # FastAPI on port 7860
319
- β”œβ”€β”€ frontend/ # Interactive HTML/JS frontend served at /ui/
320
- β”œβ”€β”€ tests/ # 41 pytest tests
321
- β”œβ”€β”€ demos/ # GIF/PNG demo assets
322
- β”œβ”€β”€ openenv.yaml # OpenEnv environment manifest
323
- β”œβ”€β”€ Dockerfile # HF Space build
324
- β”œβ”€β”€ BLOG.md # Long-form write-up
325
- └── README.md # You are here
326
- ```
327
-
328
- ---
329
-
330
- ## Architecture Decisions
331
-
332
- 1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, βˆ’3 Γ— loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
333
- 2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic β€” protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
334
- 3. **Two-stage training (SFT β†’ GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
335
- 4. **3-layer action parser.** JSON parse β†’ regex fallback β†’ safe-`idle`. The training loop never breaks on malformed model output.
336
- 5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) β†’ "What broke").
337
- 6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem β€” every run is byte-for-byte reproducible.
338
- 7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent β€” TRL, vLLM, an OpenAI-compatible API client, a curl loop β€” can drive it.
339
-
340
- ---
341
-
342
- ## Citation
343
-
344
- If you use this environment, please cite:
345
-
346
- ```bibtex
347
- @misc{wildfire-containment-simulator-2026,
348
- title = {Wildfire Containment Simulator: Long-Horizon Planning and
349
- Instruction Following for Disaster-Response LLM Agents},
350
- author = {Team Wildfire},
351
- year = {2026},
352
- url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
353
- note = {Meta OpenEnv Hackathon submission, Theme 2}
354
- }
355
- ```
356
-
357
- ---
358
-
359
- ## License
360
-
361
- [MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta Γ— Hugging Face Γ— Scaler hackathon, April 2026.
 
 
 
 
1
+ ---
2
+ title: Wildfire Containment Simulator
3
+ emoji: πŸ”₯
4
+ colorFrom: red
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ license: mit
9
+ tags:
10
+ - reinforcement-learning
11
+ - simulation
12
+ - openenv
13
+ - wildfire
14
+ - rl-environment
15
+ - long-horizon
16
+ - instruction-following
17
+ ---
18
+
19
+ # Wildfire Containment Simulator
20
+
21
+ **Meta OpenEnv Hackathon β€” Theme 2: Long-Horizon Planning & Instruction Following**
22
+
23
+ ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
24
+ ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
25
+ ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
26
+ ![Python](https://img.shields.io/badge/Python-3.11+-blue)
27
+ ![License](https://img.shields.io/badge/License-MIT-green)
28
+
29
+ A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
30
+
31
+ > **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier β€” vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy β†’ medium β†’ hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout.
32
+ > *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)*
33
+
34
+ ---
35
+
36
+ ## πŸ”— Quick Links
37
+
38
+ | Resource | Link |
39
+ |---|---|
40
+ | πŸš€ **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
41
+ | πŸ’» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
42
+ | πŸ“’ **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
43
+ | πŸ“’ **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
44
+ | πŸ“ **Long-form blog post** | [`BLOG.md`](BLOG.md) |
45
+ | πŸ“Š **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
46
+ | πŸ“ˆ **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) |
47
+ | 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
48
+ | πŸŽ₯ **2-minute pitch video** | *(YouTube link coming soon)* |
49
+
50
+ ---
51
+
52
+ ## Why Theme 2
53
+
54
+ | Pillar | How we model it |
55
+ |---|---|
56
+ | **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β€” greedy local moves cannot capture it. |
57
+ | **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
58
+ | **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
59
+
60
+ ---
61
+
62
+ ## Real-World Motivation
63
+
64
+ Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β€” partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β€” so an LLM can be trained, evaluated, and inspected on it end-to-end.
65
+
66
+ For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
67
+
68
+ ---
69
+
70
+ ## Quickstart
71
+
72
+ ```bash
73
+ # Clone and install
74
+ git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
75
+ cd Wildfire-Containment-Simulator
76
+ uv pip install -r requirements.txt
77
+ uv pip install -e .
78
+
79
+ # Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
80
+ python scripts/evaluate.py 5
81
+
82
+ # Compare agents head-to-head
83
+ python scripts/eval_compare.py --seeds 42 43 44 45 46 \
84
+ --tiers easy medium hard --agents random heuristic
85
+
86
+ # Render an episode as a GIF
87
+ python scripts/replay.py --tier medium --seed 42 \
88
+ --agent heuristic --output demos/replay.gif
89
+
90
+ # Spin up the OpenEnv FastAPI server locally on port 7860
91
+ python server/app.py
92
+ # Then visit http://localhost:7860/ui/ for the interactive frontend
93
+ ```
94
+
95
+ Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
96
+
97
+ ---
98
+
99
+ ## Live Hugging Face Space
100
+
101
+ The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β€” no Python import needed:
102
+
103
+ ```bash
104
+ SPACE=https://eshit-wildfire-containment-simulator.hf.space
105
+
106
+ curl "$SPACE/health"
107
+ curl -X POST "$SPACE/reset?task_id=easy&seed=42"
108
+ curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
109
+ -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
110
+ ```
111
+
112
+ Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
113
+
114
+ ---
115
+
116
+ ## Environment API
117
+
118
+ ```python
119
+ from env import WildfireEnv, Action, ActionType, Direction
120
+
121
+ env = WildfireEnv()
122
+ obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
123
+
124
+ while not env.done:
125
+ action = Action(
126
+ action_type=ActionType.DEPLOY_CREW,
127
+ crew_id="crew_0",
128
+ target_row=7, target_col=7,
129
+ )
130
+ result = env.step(action) # StepResult
131
+ obs = result.observation
132
+ reward = result.reward # decomposed float, range ~βˆ’8 to +8
133
+ done = result.done
134
+
135
+ state = env.state() # Full ground truth (grading only)
136
+ ```
137
+
138
+ `reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β€” agents must work from `Observation`.
139
+
140
+ ---
141
+
142
+ ## Action Space
143
+
144
+ All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
145
+
146
+ | Action | Required parameters | Description |
147
+ |---|---|---|
148
+ | `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
149
+ | `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
150
+ | `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
151
+ | `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ—3 retardant drop with 5-step cooldown |
152
+ | `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
153
+ | `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ—10 area for 5 steps |
154
+ | `idle` | `reason` *(optional)* | Explicitly wait |
155
+
156
+ A 3-layer parser (`env/action_parser.py`) maps raw LLM output β†’ structured `Action`: direct JSON β†’ regex field extraction β†’ safe-`idle` fallback. **The environment loop never breaks on bad model output.**
157
+
158
+ ---
159
+
160
+ ## Observation Space
161
+
162
+ | Component | Contents | Noise / occlusion |
163
+ |---|---|---|
164
+ | `briefing` | `OperationalBriefing` on first obs β€” incident ID, priority zones, infrastructure, wind forecast | First step only |
165
+ | `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
166
+ | `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard |
167
+ | `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
168
+ | `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
169
+ | `recent_events` | Last 5 notable events | Fully observable |
170
+
171
+ The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
172
+
173
+ ---
174
+
175
+ ## Reward Function
176
+
177
+ Decomposed for GRPO β€” wide reward range produces meaningful advantages between rollout groups.
178
+
179
+ **Per-step (dense):**
180
+ ```
181
+ step_reward = 0.4 Β· Ξ”containment + 0.4 Β· Ξ”population_safety βˆ’ 0.1 Β· redundant_action_flag
182
+ ```
183
+
184
+ **Terminal (sparse, on episode end):**
185
+ ```
186
+ +5.0 if all populations safe
187
+ +0–2.0 efficiency bonus (faster containment β‡’ more)
188
+ +1.0 briefing-adherence bonus (all priority zones survived)
189
+ βˆ’3.0 Β· (pop_lost / total_pop) if any population lost
190
+ βˆ’2.0 if any crew casualty
191
+ βˆ’0.01 Γ— invalid_action_count capped at βˆ’0.2
192
+ ```
193
+
194
+ Total empirical range: **βˆ’8 to +8**, declared in `openenv.yaml`.
195
+
196
+ | Tier | Spread scale | Episode length | Approx. reward ceiling |
197
+ |---|---|---|---|
198
+ | Easy | 1.00Γ— | 80 | +8 |
199
+ | Medium | 0.70Γ— | 150 | +7 |
200
+ | Hard | 0.55Γ— | 300 | +6 |
201
+
202
+ ---
203
+
204
+ ## Three Difficulty Tiers
205
+
206
+ ### Task 1 β€” Easy: Flatland Grass Fire
207
+ 15Γ—15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control.
208
+
209
+ ### Task 2 β€” Medium: Canyon Terrain with Wind Shifts
210
+ 25Γ—25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
211
+
212
+ ### Task 3 β€” Hard: Wildland-Urban Interface Crisis
213
+ 40Γ—40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
214
+
215
+ ---
216
+
217
+ ## Fire Spread Model
218
+
219
+ A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
220
+
221
+ ```
222
+ P(ignite) = base_rate Γ— fuel_factor Γ— wind_factor Γ— slope_factor
223
+ Γ— (1 βˆ’ moisture) Γ— (1 βˆ’ suppression) Γ— tier_scale
224
+ ```
225
+
226
+ | Factor | Effect |
227
+ |---|---|
228
+ | `base_rate` | Baseline spread by fuel type |
229
+ | `fuel_factor` | Fuel load of the target cell |
230
+ | `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
231
+ | `slope_factor` | Faster uphill, slower downhill |
232
+ | `moisture` | Wet ground / recent rain reduces ignition probability |
233
+ | `suppression` | Crew presence and retardant coverage reduce spread |
234
+ | `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
235
+
236
+ Burning cells progress through `BURNING β†’ EMBER β†’ BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
237
+
238
+ ---
239
+
240
+ ## Results
241
+
242
+ > Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99).
243
+
244
+ | Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
245
+ |---|---|---|---|
246
+ | Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
247
+ | Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | **+4.74 Β± 3.79** |
248
+ | **Trained Qwen-2.5-7B (ours)** | +5.13 Β± 3.90 | **+5.74 Β± 3.07** | +2.14 Β± 2.87 |
249
+ | **Ξ” vs. Heuristic** | βˆ’2.41 | **βˆ’0.58 βœ“** | βˆ’2.59 |
250
+
251
+ The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion).
252
+
253
+ **Auxiliary metrics for the trained agent:**
254
+
255
+ | Metric | Easy | Medium | Hard |
256
+ |---|---|---|---|
257
+ | JSON success rate | 98.5% | 99.8% | 99.2% |
258
+ | Mean population saved % | 87% | 97% | 92% |
259
+
260
+ **Curriculum progression:** easy (steps 0οΏ½οΏ½οΏ½52) β†’ medium (steps 53–62) β†’ hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps.
261
+
262
+ > Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).
263
+
264
+ ---
265
+
266
+ ## Training
267
+
268
+ We use a two-stage recipe:
269
+
270
+ 1. **SFT warm-up** β€” generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
271
+ 2. **GRPO (TRL `GRPOTrainer`)** β€” start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy β†’ medium β†’ hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
272
+
273
+ **Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
274
+ **Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.
275
+
276
+ **Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git β€” generate with `python scripts/plot_grpo_training.py`).
277
+
278
+ For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
279
+
280
+ ---
281
+
282
+ ## Project Structure
283
+
284
+ ```text
285
+ Wildfire-Containment-Simulator/
286
+ β”œβ”€β”€ env/
287
+ β”‚ β”œβ”€β”€ wildfire_env.py # Main env: reset(), step(), state()
288
+ β”‚ β”œβ”€β”€ models.py # Pydantic action/observation/state models
289
+ β”‚ β”œβ”€β”€ grid.py # Terrain, smoke, moisture, fog-of-war
290
+ β”‚ β”œβ”€β”€ fire_spread.py # Cellular automaton fire propagation
291
+ β”‚ β”œβ”€β”€ weather.py # Stochastic weather engine
292
+ β”‚ β”œβ”€β”€ resources.py # Crews, tankers, firebreaks, recon
293
+ β”‚ β”œβ”€β”€ reward.py # Decomposed step + terminal reward
294
+ β”‚ β”œβ”€β”€ briefing.py # OperationalBriefing generation
295
+ β”‚ β”œβ”€β”€ serialization.py # Observation β†’ LLM prompt
296
+ β”‚ β”œβ”€β”€ action_parser.py # LLM output β†’ Action (3-layer fallback)
297
+ β”‚ β”œβ”€β”€ rendering.py # Frame rendering for GIF replays
298
+ β”‚ └── curriculum.py # CurriculumController (auto-promote/demote)
299
+ β”œβ”€β”€ agents/
300
+ β”‚ β”œβ”€β”€ random_agent.py
301
+ β”‚ └── heuristic_agent.py
302
+ β”œβ”€β”€ graders/
303
+ β”‚ β”œβ”€β”€ grader_easy.py # β†’ (total_reward, details_dict)
304
+ β”‚ β”œβ”€β”€ grader_medium.py
305
+ β”‚ └── grader_hard.py
306
+ β”œβ”€β”€ scripts/
307
+ β”‚ β”œβ”€β”€ evaluate.py # Baseline eval (random + heuristic)
308
+ β”‚ β”œβ”€β”€ eval_compare.py # Multi-agent comparison
309
+ β”‚ β”œβ”€β”€ eval_trained_model.py # Evaluate a trained adapter
310
+ β”‚ β”œβ”€β”€ generate_sft_data.py # Build SFT dataset from heuristic rollouts
311
+ β”‚ β”œβ”€β”€ replay.py # Render episode as GIF
312
+ β”‚ β”œβ”€β”€ run_demo.py # Pitch demo
313
+ β”‚ └── plot_dashboard.py # 4-panel training curves
314
+ β”œβ”€β”€ training/
315
+ β”‚ β”œβ”€β”€ grpo_v2_colab.ipynb # GRPO notebook (canonical)
316
+ β”‚ β”œβ”€β”€ sft_colab.ipynb # SFT warm-up notebook
317
+ β”‚ β”œβ”€β”€ sft_data.jsonl # 4,300 SFT examples
318
+ β”‚ β”œβ”€β”€ requirements.txt # Training deps (Unsloth, TRL, etc.)
319
+ β”‚ └── README.md
320
+ β”œβ”€β”€ server/
321
+ β”‚ └── app.py # FastAPI on port 7860
322
+ β”œβ”€β”€ frontend/ # Interactive HTML/JS frontend served at /ui/
323
+ β”œβ”€β”€ tests/ # 41 pytest tests
324
+ β”œβ”€β”€ demos/ # GIF/PNG demo assets
325
+ β”œβ”€β”€ openenv.yaml # OpenEnv environment manifest
326
+ β”œβ”€β”€ Dockerfile # HF Space build
327
+ β”œβ”€β”€ BLOG.md # Long-form write-up
328
+ └── README.md # You are here
329
+ ```
330
+
331
+ ---
332
+
333
+ ## Architecture Decisions
334
+
335
+ 1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, βˆ’3 Γ— loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
336
+ 2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic β€” protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
337
+ 3. **Two-stage training (SFT β†’ GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
338
+ 4. **3-layer action parser.** JSON parse β†’ regex fallback β†’ safe-`idle`. The training loop never breaks on malformed model output.
339
+ 5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) β†’ "What broke").
340
+ 6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem β€” every run is byte-for-byte reproducible.
341
+ 7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent β€” TRL, vLLM, an OpenAI-compatible API client, a curl loop β€” can drive it.
342
+
343
+ ---
344
+
345
+ ## Citation
346
+
347
+ If you use this environment, please cite:
348
+
349
+ ```bibtex
350
+ @misc{wildfire-containment-simulator-2026,
351
+ title = {Wildfire Containment Simulator: Long-Horizon Planning and
352
+ Instruction Following for Disaster-Response LLM Agents},
353
+ author = {Team Wildfire},
354
+ year = {2026},
355
+ url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
356
+ note = {Meta OpenEnv Hackathon submission, Theme 2}
357
+ }
358
+ ```
359
+
360
+ ---
361
+
362
+ ## License
363
+
364
+ [MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta Γ— Hugging Face Γ— Scaler hackathon, April 2026.