File size: 18,878 Bytes
d377d79 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 | ---
title: Wildfire Containment Simulator
emoji: π₯
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
- reinforcement-learning
- simulation
- openenv
- wildfire
- rl-environment
- long-horizon
- instruction-following
---
# Wildfire Containment Simulator
**Meta OpenEnv Hackathon β Theme 2: Long-Horizon Planning & Instruction Following**





A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80β300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier β vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy β medium β hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout.
> *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)*
---
## π Quick Links
| Resource | Link |
|---|---|
| π **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
| π» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
| π **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
| π **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
| π **Long-form blog post** | [`BLOG.md`](BLOG.md) |
| π **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
| π **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) |
| π¬ **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
| π₯ **2-minute pitch video** | *(YouTube link coming soon)* |
---
## Why Theme 2
| Pillar | How we model it |
|---|---|
| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β greedy local moves cannot capture it. |
| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
---
## Real-World Motivation
Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β so an LLM can be trained, evaluated, and inspected on it end-to-end.
For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
---
## Quickstart
```bash
# Clone and install
git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
cd Wildfire-Containment-Simulator
uv pip install -r requirements.txt
uv pip install -e .
# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
python scripts/evaluate.py 5
# Compare agents head-to-head
python scripts/eval_compare.py --seeds 42 43 44 45 46 \
--tiers easy medium hard --agents random heuristic
# Render an episode as a GIF
python scripts/replay.py --tier medium --seed 42 \
--agent heuristic --output demos/replay.gif
# Spin up the OpenEnv FastAPI server locally on port 7860
python server/app.py
# Then visit http://localhost:7860/ui/ for the interactive frontend
```
Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
---
## Live Hugging Face Space
The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β no Python import needed:
```bash
SPACE=https://eshit-wildfire-containment-simulator.hf.space
curl "$SPACE/health"
curl -X POST "$SPACE/reset?task_id=easy&seed=42"
curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
-d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
```
Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
---
## Environment API
```python
from env import WildfireEnv, Action, ActionType, Direction
env = WildfireEnv()
obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
while not env.done:
action = Action(
action_type=ActionType.DEPLOY_CREW,
crew_id="crew_0",
target_row=7, target_col=7,
)
result = env.step(action) # StepResult
obs = result.observation
reward = result.reward # decomposed float, range ~β8 to +8
done = result.done
state = env.state() # Full ground truth (grading only)
```
`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β agents must work from `Observation`.
---
## Action Space
All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
| Action | Required parameters | Description |
|---|---|---|
| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ3 retardant drop with 5-step cooldown |
| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
| `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ10 area for 5 steps |
| `idle` | `reason` *(optional)* | Explicitly wait |
A 3-layer parser (`env/action_parser.py`) maps raw LLM output β structured `Action`: direct JSON β regex field extraction β safe-`idle` fallback. **The environment loop never breaks on bad model output.**
---
## Observation Space
| Component | Contents | Noise / occlusion |
|---|---|---|
| `briefing` | `OperationalBriefing` on first obs β incident ID, priority zones, infrastructure, wind forecast | First step only |
| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard |
| `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
| `recent_events` | Last 5 notable events | Fully observable |
The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
---
## Reward Function
Decomposed for GRPO β wide reward range produces meaningful advantages between rollout groups.
**Per-step (dense):**
```
step_reward = 0.4 Β· Ξcontainment + 0.4 Β· Ξpopulation_safety β 0.1 Β· redundant_action_flag
```
**Terminal (sparse, on episode end):**
```
+5.0 if all populations safe
+0β2.0 efficiency bonus (faster containment β more)
+1.0 briefing-adherence bonus (all priority zones survived)
β3.0 Β· (pop_lost / total_pop) if any population lost
β2.0 if any crew casualty
β0.01 Γ invalid_action_count capped at β0.2
```
Total empirical range: **β8 to +8**, declared in `openenv.yaml`.
| Tier | Spread scale | Episode length | Approx. reward ceiling |
|---|---|---|---|
| Easy | 1.00Γ | 80 | +8 |
| Medium | 0.70Γ | 150 | +7 |
| Hard | 0.55Γ | 300 | +6 |
---
## Three Difficulty Tiers
### Task 1 β Easy: Flatland Grass Fire
15Γ15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control.
### Task 2 β Medium: Canyon Terrain with Wind Shifts
25Γ25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
### Task 3 β Hard: Wildland-Urban Interface Crisis
40Γ40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
---
## Fire Spread Model
A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
```
P(ignite) = base_rate Γ fuel_factor Γ wind_factor Γ slope_factor
Γ (1 β moisture) Γ (1 β suppression) Γ tier_scale
```
| Factor | Effect |
|---|---|
| `base_rate` | Baseline spread by fuel type |
| `fuel_factor` | Fuel load of the target cell |
| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
| `slope_factor` | Faster uphill, slower downhill |
| `moisture` | Wet ground / recent rain reduces ignition probability |
| `suppression` | Crew presence and retardant coverage reduce spread |
| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
Burning cells progress through `BURNING β EMBER β BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
---
## Results
> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42β46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42β56 (15 per tier, no overlap with training seeds 0β99).
| Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
|---|---|---|---|
| Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
| Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | **+4.74 Β± 3.79** |
| **Trained Qwen-2.5-7B (ours)** | +5.13 Β± 3.90 | **+5.74 Β± 3.07** | +2.14 Β± 2.87 |
| **Ξ vs. Heuristic** | β2.41 | **β0.58 β** | β2.59 |
The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion).
**Auxiliary metrics for the trained agent:**
| Metric | Easy | Medium | Hard |
|---|---|---|---|
| JSON success rate | 98.5% | 99.8% | 99.2% |
| Mean population saved % | 87% | 97% | 92% |
**Curriculum progression:** easy (steps 0β52) β medium (steps 53β62) β hard (steps 63β149). The model reached hard tier in just 63 of 150 training steps.
> Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).
---
## Training
We use a two-stage recipe:
1. **SFT warm-up** β generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
2. **GRPO (TRL `GRPOTrainer`)** β start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy β medium β hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
**Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
**Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.
**Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git β generate with `python scripts/plot_grpo_training.py`).
For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
---
## Project Structure
```text
Wildfire-Containment-Simulator/
βββ env/
β βββ wildfire_env.py # Main env: reset(), step(), state()
β βββ models.py # Pydantic action/observation/state models
β βββ grid.py # Terrain, smoke, moisture, fog-of-war
β βββ fire_spread.py # Cellular automaton fire propagation
β βββ weather.py # Stochastic weather engine
β βββ resources.py # Crews, tankers, firebreaks, recon
β βββ reward.py # Decomposed step + terminal reward
β βββ briefing.py # OperationalBriefing generation
β βββ serialization.py # Observation β LLM prompt
β βββ action_parser.py # LLM output β Action (3-layer fallback)
β βββ rendering.py # Frame rendering for GIF replays
β βββ curriculum.py # CurriculumController (auto-promote/demote)
βββ agents/
β βββ random_agent.py
β βββ heuristic_agent.py
βββ graders/
β βββ grader_easy.py # β (total_reward, details_dict)
β βββ grader_medium.py
β βββ grader_hard.py
βββ scripts/
β βββ evaluate.py # Baseline eval (random + heuristic)
β βββ eval_compare.py # Multi-agent comparison
β βββ eval_trained_model.py # Evaluate a trained adapter
β βββ generate_sft_data.py # Build SFT dataset from heuristic rollouts
β βββ replay.py # Render episode as GIF
β βββ run_demo.py # Pitch demo
β βββ plot_dashboard.py # 4-panel training curves
βββ training/
β βββ grpo_v2_colab.ipynb # GRPO notebook (canonical)
β βββ sft_colab.ipynb # SFT warm-up notebook
β βββ sft_data.jsonl # 4,300 SFT examples
β βββ requirements.txt # Training deps (Unsloth, TRL, etc.)
β βββ README.md
βββ server/
β βββ app.py # FastAPI on port 7860
βββ frontend/ # Interactive HTML/JS frontend served at /ui/
βββ tests/ # 41 pytest tests
βββ demos/ # GIF/PNG demo assets
βββ openenv.yaml # OpenEnv environment manifest
βββ Dockerfile # HF Space build
βββ BLOG.md # Long-form write-up
βββ README.md # You are here
```
---
## Architecture Decisions
1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, β3 Γ loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic β protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
3. **Two-stage training (SFT β GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
4. **3-layer action parser.** JSON parse β regex fallback β safe-`idle`. The training loop never breaks on malformed model output.
5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) β "What broke").
6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem β every run is byte-for-byte reproducible.
7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent β TRL, vLLM, an OpenAI-compatible API client, a curl loop β can drive it.
---
## Citation
If you use this environment, please cite:
```bibtex
@misc{wildfire-containment-simulator-2026,
title = {Wildfire Containment Simulator: Long-Horizon Planning and
Instruction Following for Disaster-Response LLM Agents},
author = {Team Wildfire},
year = {2026},
url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
note = {Meta OpenEnv Hackathon submission, Theme 2}
}
```
---
## License
[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta Γ Hugging Face Γ Scaler hackathon, April 2026.
|