---
title: Wildfire Containment Simulator
emoji: 🔥
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
  - reinforcement-learning
  - simulation
  - openenv
  - wildfire
  - rl-environment
  - long-horizon
  - instruction-following
---

# Wildfire Containment Simulator

**Meta OpenEnv Hackathon — Theme 2: Long-Horizon Planning & Instruction Following**

![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
![Python](https://img.shields.io/badge/Python-3.11+-blue)
![License](https://img.shields.io/badge/License-MIT-green)

A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.

> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier — vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy → medium → hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout.
> *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)*

---

## 🔗 Quick Links

| Resource | Link |
|---|---|
| 🚀 **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
| 💻 **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
| 📒 **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
| 📒 **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
| 📝 **Long-form blog post** | [`BLOG.md`](BLOG.md) |
| 📊 **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
| 📈 **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) |
| 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
| 🎥 **2-minute pitch video** | *(YouTube link coming soon)* |

---

## Why Theme 2

| Pillar | How we model it |
|---|---|
| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival — greedy local moves cannot capture it. |
| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |

---

## Real-World Motivation

Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work — partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety — so an LLM can be trained, evaluated, and inspected on it end-to-end.

For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).

---

## Quickstart

```bash
# Clone and install
git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
cd Wildfire-Containment-Simulator
uv pip install -r requirements.txt
uv pip install -e .

# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
python scripts/evaluate.py 5

# Compare agents head-to-head
python scripts/eval_compare.py --seeds 42 43 44 45 46 \
    --tiers easy medium hard --agents random heuristic

# Render an episode as a GIF
python scripts/replay.py --tier medium --seed 42 \
    --agent heuristic --output demos/replay.gif

# Spin up the OpenEnv FastAPI server locally on port 7860
python server/app.py
# Then visit http://localhost:7860/ui/ for the interactive frontend
```

Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).

---

## Live Hugging Face Space

The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP — no Python import needed:

```bash
SPACE=https://eshit-wildfire-containment-simulator.hf.space

curl "$SPACE/health"
curl -X POST "$SPACE/reset?task_id=easy&seed=42"
curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
    -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
```

Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).

---

## Environment API

```python
from env import WildfireEnv, Action, ActionType, Direction

env = WildfireEnv()
obs = env.reset(task_id="easy", seed=42)   # Observation (with OperationalBriefing on first step)

while not env.done:
    action = Action(
        action_type=ActionType.DEPLOY_CREW,
        crew_id="crew_0",
        target_row=7, target_col=7,
    )
    result = env.step(action)               # StepResult
    obs = result.observation
    reward = result.reward                  # decomposed float, range ~−8 to +8
    done = result.done

state = env.state()                          # Full ground truth (grading only)
```

`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders — agents must work from `Observation`.

---

## Action Space

All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**

| Action | Required parameters | Description |
|---|---|---|
| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3×3 retardant drop with 5-step cooldown |
| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
| `recon_flight` | `target_row`, `target_col` | Reveal a 10×10 area for 5 steps |
| `idle` | `reason` *(optional)* | Explicitly wait |

A 3-layer parser (`env/action_parser.py`) maps raw LLM output → structured `Action`: direct JSON → regex field extraction → safe-`idle` fallback. **The environment loop never breaks on bad model output.**

---

## Observation Space

| Component | Contents | Noise / occlusion |
|---|---|---|
| `briefing` | `OperationalBriefing` on first obs — incident ID, priority zones, infrastructure, wind forecast | First step only |
| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | ±5 km/h, ±20° on medium/hard |
| `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
| `recent_events` | Last 5 notable events | Fully observable |

The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.

---

## Reward Function

Decomposed for GRPO — wide reward range produces meaningful advantages between rollout groups.

**Per-step (dense):**
```
step_reward = 0.4 · Δcontainment + 0.4 · Δpopulation_safety − 0.1 · redundant_action_flag
```

**Terminal (sparse, on episode end):**
```
+5.0   if all populations safe
+0–2.0 efficiency bonus (faster containment ⇒ more)
+1.0   briefing-adherence bonus (all priority zones survived)
−3.0 · (pop_lost / total_pop)   if any population lost
−2.0   if any crew casualty
−0.01 × invalid_action_count    capped at −0.2
```

Total empirical range: **−8 to +8**, declared in `openenv.yaml`.

| Tier | Spread scale | Episode length | Approx. reward ceiling |
|---|---|---|---|
| Easy | 1.00× | 80 | +8 |
| Medium | 0.70× | 150 | +7 |
| Hard | 0.55× | 300 | +6 |

---

## Three Difficulty Tiers

### Task 1 — Easy: Flatland Grass Fire
15×15 flat grid · single ignition · constant wind · no smoke or fog-of-war · 4 crews, 1 tanker, 15 firebreak cells · 80 steps. **Focus:** basic deployment and perimeter control.

### Task 2 — Medium: Canyon Terrain with Wind Shifts
25×25 mixed terrain · two ignition points · variable wind · smoke occlusion · sensor noise · 5 crews, 2 tankers, 20 firebreak cells, 1 recon · 150 steps. **Focus:** terrain-aware containment under multi-front pressure.

### Task 3 — Hard: Wildland-Urban Interface Crisis
40×40 terrain with roads, rivers, urban zones · staggered ignitions (step 30) · scripted crew casualty (step 40) · fog-of-war (radius 7) · aggressive wind shifts · 6 crews, 3 tankers, 30 firebreak cells, 3 recon · 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.

---

## Fire Spread Model

A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:

```
P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
            × (1 − moisture) × (1 − suppression) × tier_scale
```

| Factor | Effect |
|---|---|
| `base_rate` | Baseline spread by fuel type |
| `fuel_factor` | Fuel load of the target cell |
| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
| `slope_factor` | Faster uphill, slower downhill |
| `moisture` | Wet ground / recent rain reduces ignition probability |
| `suppression` | Crew presence and retardant coverage reduce spread |
| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |

Burning cells progress through `BURNING → EMBER → BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.

---

## Results

> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99).

| Agent | Easy (mean ± std) | Medium (mean ± std) | Hard (mean ± std) |
|---|---|---|---|
| Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
| Heuristic | **+7.53 ± 0.08** | **+6.31 ± 2.77** | **+4.74 ± 3.79** |
| **Trained Qwen-2.5-7B (ours)** | +5.13 ± 3.90 | **+5.74 ± 3.07** | +2.14 ± 2.87 |
| **Δ vs. Heuristic** | −2.41 | **−0.58 ✓** | −2.59 |

The medium tier result passes the ±1.0 of heuristic threshold (official passing criterion).

**Auxiliary metrics for the trained agent:**

| Metric | Easy | Medium | Hard |
|---|---|---|---|
| JSON success rate | 98.5% | 99.8% | 99.2% |
| Mean population saved % | 87% | 97% | 92% |

**Curriculum progression:** easy (steps 0–52) → medium (steps 53–62) → hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps.

> Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).

---

## Training

We use a two-stage recipe:

1. **SFT warm-up** — generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
2. **GRPO (TRL `GRPOTrainer`)** — start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy → medium → hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).

**Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
**Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.

**Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git — generate with `python scripts/plot_grpo_training.py`).

For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).

---

## Project Structure

```text
Wildfire-Containment-Simulator/
├── env/
│   ├── wildfire_env.py       # Main env: reset(), step(), state()
│   ├── models.py             # Pydantic action/observation/state models
│   ├── grid.py               # Terrain, smoke, moisture, fog-of-war
│   ├── fire_spread.py        # Cellular automaton fire propagation
│   ├── weather.py            # Stochastic weather engine
│   ├── resources.py          # Crews, tankers, firebreaks, recon
│   ├── reward.py             # Decomposed step + terminal reward
│   ├── briefing.py           # OperationalBriefing generation
│   ├── serialization.py      # Observation → LLM prompt
│   ├── action_parser.py      # LLM output → Action (3-layer fallback)
│   ├── rendering.py          # Frame rendering for GIF replays
│   └── curriculum.py         # CurriculumController (auto-promote/demote)
├── agents/
│   ├── random_agent.py
│   └── heuristic_agent.py
├── graders/
│   ├── grader_easy.py        # → (total_reward, details_dict)
│   ├── grader_medium.py
│   └── grader_hard.py
├── scripts/
│   ├── evaluate.py           # Baseline eval (random + heuristic)
│   ├── eval_compare.py       # Multi-agent comparison
│   ├── eval_trained_model.py # Evaluate a trained adapter
│   ├── generate_sft_data.py  # Build SFT dataset from heuristic rollouts
│   ├── replay.py             # Render episode as GIF
│   ├── run_demo.py           # Pitch demo
│   └── plot_dashboard.py     # 4-panel training curves
├── training/
│   ├── grpo_v2_colab.ipynb   # GRPO notebook (canonical)
│   ├── sft_colab.ipynb       # SFT warm-up notebook
│   ├── sft_data.jsonl        # 4,300 SFT examples
│   ├── requirements.txt      # Training deps (Unsloth, TRL, etc.)
│   └── README.md
├── server/
│   └── app.py                # FastAPI on port 7860
├── frontend/                 # Interactive HTML/JS frontend served at /ui/
├── tests/                    # 41 pytest tests
├── demos/                    # GIF/PNG demo assets
├── openenv.yaml              # OpenEnv environment manifest
├── Dockerfile                # HF Space build
├── BLOG.md                   # Long-form write-up
└── README.md                 # You are here
```

---

## Architecture Decisions

1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, −3 × loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic — protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
3. **Two-stage training (SFT → GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
4. **3-layer action parser.** JSON parse → regex fallback → safe-`idle`. The training loop never breaks on malformed model output.
5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) → "What broke").
6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem — every run is byte-for-byte reproducible.
7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent — TRL, vLLM, an OpenAI-compatible API client, a curl loop — can drive it.

---

## Citation

If you use this environment, please cite:

```bibtex
@misc{wildfire-containment-simulator-2026,
  title  = {Wildfire Containment Simulator: Long-Horizon Planning and
            Instruction Following for Disaster-Response LLM Agents},
  author = {Team Wildfire},
  year   = {2026},
  url    = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
  note   = {Meta OpenEnv Hackathon submission, Theme 2}
}
```

---

## License

[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta × Hugging Face × Scaler hackathon, April 2026.