File size: 18,878 Bytes
d377d79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---

title: Wildfire Containment Simulator
emoji: πŸ”₯
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
  - reinforcement-learning
  - simulation
  - openenv
  - wildfire
  - rl-environment
  - long-horizon
  - instruction-following
---


# Wildfire Containment Simulator

**Meta OpenEnv Hackathon β€” Theme 2: Long-Horizon Planning & Instruction Following**

![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
![Python](https://img.shields.io/badge/Python-3.11+-blue)
![License](https://img.shields.io/badge/License-MIT-green)

A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.

> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier β€” vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy β†’ medium β†’ hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout.
> *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)*

---

## πŸ”— Quick Links

| Resource | Link |
|---|---|
| πŸš€ **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
| πŸ’» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
| πŸ“’ **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
| πŸ“’ **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
| πŸ“ **Long-form blog post** | [`BLOG.md`](BLOG.md) |
| πŸ“Š **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
| πŸ“ˆ **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) |
| 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
| πŸŽ₯ **2-minute pitch video** | *(YouTube link coming soon)* |

---

## Why Theme 2

| Pillar | How we model it |
|---|---|
| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β€” greedy local moves cannot capture it. |
| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |

---

## Real-World Motivation

Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β€” partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β€” so an LLM can be trained, evaluated, and inspected on it end-to-end.

For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).

---

## Quickstart

```bash

# Clone and install

git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git

cd Wildfire-Containment-Simulator

uv pip install -r requirements.txt

uv pip install -e .



# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)

python scripts/evaluate.py 5



# Compare agents head-to-head

python scripts/eval_compare.py --seeds 42 43 44 45 46 \

    --tiers easy medium hard --agents random heuristic



# Render an episode as a GIF

python scripts/replay.py --tier medium --seed 42 \

    --agent heuristic --output demos/replay.gif



# Spin up the OpenEnv FastAPI server locally on port 7860

python server/app.py

# Then visit http://localhost:7860/ui/ for the interactive frontend

```

Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).

---

## Live Hugging Face Space

The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β€” no Python import needed:

```bash

SPACE=https://eshit-wildfire-containment-simulator.hf.space



curl "$SPACE/health"

curl -X POST "$SPACE/reset?task_id=easy&seed=42"

curl -X POST "$SPACE/step" -H "Content-Type: application/json" \

    -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'

```

Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).

---

## Environment API

```python

from env import WildfireEnv, Action, ActionType, Direction



env = WildfireEnv()

obs = env.reset(task_id="easy", seed=42)   # Observation (with OperationalBriefing on first step)



while not env.done:

    action = Action(

        action_type=ActionType.DEPLOY_CREW,

        crew_id="crew_0",

        target_row=7, target_col=7,

    )

    result = env.step(action)               # StepResult

    obs = result.observation

    reward = result.reward                  # decomposed float, range ~βˆ’8 to +8

    done = result.done



state = env.state()                          # Full ground truth (grading only)

```

`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β€” agents must work from `Observation`.

---

## Action Space

All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**

| Action | Required parameters | Description |
|---|---|---|
| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ—3 retardant drop with 5-step cooldown |
| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
| `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ—10 area for 5 steps |
| `idle` | `reason` *(optional)* | Explicitly wait |

A 3-layer parser (`env/action_parser.py`) maps raw LLM output β†’ structured `Action`: direct JSON β†’ regex field extraction β†’ safe-`idle` fallback. **The environment loop never breaks on bad model output.**

---

## Observation Space

| Component | Contents | Noise / occlusion |
|---|---|---|
| `briefing` | `OperationalBriefing` on first obs β€” incident ID, priority zones, infrastructure, wind forecast | First step only |
| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard |
| `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
| `recent_events` | Last 5 notable events | Fully observable |

The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.

---

## Reward Function

Decomposed for GRPO β€” wide reward range produces meaningful advantages between rollout groups.

**Per-step (dense):**
```

step_reward = 0.4 Β· Ξ”containment + 0.4 Β· Ξ”population_safety βˆ’ 0.1 Β· redundant_action_flag

```

**Terminal (sparse, on episode end):**
```

+5.0   if all populations safe

+0–2.0 efficiency bonus (faster containment β‡’ more)

+1.0   briefing-adherence bonus (all priority zones survived)

βˆ’3.0 Β· (pop_lost / total_pop)   if any population lost

βˆ’2.0   if any crew casualty

βˆ’0.01 Γ— invalid_action_count    capped at βˆ’0.2

```

Total empirical range: **βˆ’8 to +8**, declared in `openenv.yaml`.

| Tier | Spread scale | Episode length | Approx. reward ceiling |
|---|---|---|---|
| Easy | 1.00Γ— | 80 | +8 |
| Medium | 0.70Γ— | 150 | +7 |
| Hard | 0.55Γ— | 300 | +6 |

---

## Three Difficulty Tiers

### Task 1 β€” Easy: Flatland Grass Fire
15Γ—15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control.

### Task 2 β€” Medium: Canyon Terrain with Wind Shifts
25Γ—25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure.

### Task 3 β€” Hard: Wildland-Urban Interface Crisis
40Γ—40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.

---

## Fire Spread Model

A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:

```

P(ignite) = base_rate Γ— fuel_factor Γ— wind_factor Γ— slope_factor

            Γ— (1 βˆ’ moisture) Γ— (1 βˆ’ suppression) Γ— tier_scale

```

| Factor | Effect |
|---|---|
| `base_rate` | Baseline spread by fuel type |
| `fuel_factor` | Fuel load of the target cell |
| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
| `slope_factor` | Faster uphill, slower downhill |
| `moisture` | Wet ground / recent rain reduces ignition probability |
| `suppression` | Crew presence and retardant coverage reduce spread |
| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |

Burning cells progress through `BURNING β†’ EMBER β†’ BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.

---

## Results

> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99).

| Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
|---|---|---|---|
| Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
| Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | **+4.74 Β± 3.79** |
| **Trained Qwen-2.5-7B (ours)** | +5.13 Β± 3.90 | **+5.74 Β± 3.07** | +2.14 Β± 2.87 |
| **Ξ” vs. Heuristic** | βˆ’2.41 | **βˆ’0.58 βœ“** | βˆ’2.59 |

The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion).

**Auxiliary metrics for the trained agent:**

| Metric | Easy | Medium | Hard |
|---|---|---|---|
| JSON success rate | 98.5% | 99.8% | 99.2% |
| Mean population saved % | 87% | 97% | 92% |

**Curriculum progression:** easy (steps 0–52) β†’ medium (steps 53–62) β†’ hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps.

> Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).

---

## Training

We use a two-stage recipe:

1. **SFT warm-up** β€” generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
2. **GRPO (TRL `GRPOTrainer`)** β€” start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy β†’ medium β†’ hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).

**Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
**Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.

**Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git β€” generate with `python scripts/plot_grpo_training.py`).

For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).

---

## Project Structure

```text

Wildfire-Containment-Simulator/

β”œβ”€β”€ env/

β”‚   β”œβ”€β”€ wildfire_env.py       # Main env: reset(), step(), state()

β”‚   β”œβ”€β”€ models.py             # Pydantic action/observation/state models

β”‚   β”œβ”€β”€ grid.py               # Terrain, smoke, moisture, fog-of-war

β”‚   β”œβ”€β”€ fire_spread.py        # Cellular automaton fire propagation

β”‚   β”œβ”€β”€ weather.py            # Stochastic weather engine

β”‚   β”œβ”€β”€ resources.py          # Crews, tankers, firebreaks, recon

β”‚   β”œβ”€β”€ reward.py             # Decomposed step + terminal reward

β”‚   β”œβ”€β”€ briefing.py           # OperationalBriefing generation

β”‚   β”œβ”€β”€ serialization.py      # Observation β†’ LLM prompt

β”‚   β”œβ”€β”€ action_parser.py      # LLM output β†’ Action (3-layer fallback)

β”‚   β”œβ”€β”€ rendering.py          # Frame rendering for GIF replays

β”‚   └── curriculum.py         # CurriculumController (auto-promote/demote)

β”œβ”€β”€ agents/

β”‚   β”œβ”€β”€ random_agent.py

β”‚   └── heuristic_agent.py

β”œβ”€β”€ graders/

β”‚   β”œβ”€β”€ grader_easy.py        # β†’ (total_reward, details_dict)

β”‚   β”œβ”€β”€ grader_medium.py

β”‚   └── grader_hard.py

β”œβ”€β”€ scripts/

β”‚   β”œβ”€β”€ evaluate.py           # Baseline eval (random + heuristic)

β”‚   β”œβ”€β”€ eval_compare.py       # Multi-agent comparison

β”‚   β”œβ”€β”€ eval_trained_model.py # Evaluate a trained adapter

β”‚   β”œβ”€β”€ generate_sft_data.py  # Build SFT dataset from heuristic rollouts

β”‚   β”œβ”€β”€ replay.py             # Render episode as GIF

β”‚   β”œβ”€β”€ run_demo.py           # Pitch demo

β”‚   └── plot_dashboard.py     # 4-panel training curves

β”œβ”€β”€ training/

β”‚   β”œβ”€β”€ grpo_v2_colab.ipynb   # GRPO notebook (canonical)

β”‚   β”œβ”€β”€ sft_colab.ipynb       # SFT warm-up notebook

β”‚   β”œβ”€β”€ sft_data.jsonl        # 4,300 SFT examples

β”‚   β”œβ”€β”€ requirements.txt      # Training deps (Unsloth, TRL, etc.)

β”‚   └── README.md

β”œβ”€β”€ server/

β”‚   └── app.py                # FastAPI on port 7860

β”œβ”€β”€ frontend/                 # Interactive HTML/JS frontend served at /ui/

β”œβ”€β”€ tests/                    # 41 pytest tests

β”œβ”€β”€ demos/                    # GIF/PNG demo assets

β”œβ”€β”€ openenv.yaml              # OpenEnv environment manifest

β”œβ”€β”€ Dockerfile                # HF Space build

β”œβ”€β”€ BLOG.md                   # Long-form write-up

└── README.md                 # You are here

```

---

## Architecture Decisions

1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, βˆ’3 Γ— loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic β€” protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
3. **Two-stage training (SFT β†’ GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
4. **3-layer action parser.** JSON parse β†’ regex fallback β†’ safe-`idle`. The training loop never breaks on malformed model output.
5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) β†’ "What broke").
6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem β€” every run is byte-for-byte reproducible.
7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent β€” TRL, vLLM, an OpenAI-compatible API client, a curl loop β€” can drive it.

---

## Citation

If you use this environment, please cite:

```bibtex

@misc{wildfire-containment-simulator-2026,

  title  = {Wildfire Containment Simulator: Long-Horizon Planning and

            Instruction Following for Disaster-Response LLM Agents},

  author = {Team Wildfire},

  year   = {2026},

  url    = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},

  note   = {Meta OpenEnv Hackathon submission, Theme 2}

}

```

---

## License

[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta Γ— Hugging Face Γ— Scaler hackathon, April 2026.