title: Wildfire Containment Simulator
emoji: π₯
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
- reinforcement-learning
- simulation
- openenv
- wildfire
- rl-environment
- long-horizon
- instruction-following
Wildfire Containment Simulator
Meta OpenEnv Hackathon β Theme 2: Long-Horizon Planning & Instruction Following
A partially-observable disaster simulation where an LLM acts as Incident Commander, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80β300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
Headline result (post-training run, Apr 26): Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of +5.74 on Medium tier β vs. +6.31 for the rule-based heuristic and +1.31 for the random baseline. The model auto-promoted through all three curriculum tiers (easy β medium β hard) in just 63 of 150 training steps, maintaining 99%+ JSON success rate throughout. (Full comparison table in Results. Model:
Eshit/wildfire-grpo-7b. W&B run: wildfire-grpo/runs/dnz56kuu.)
π Quick Links
| Resource | Link |
|---|---|
| π Live HF Space (env) | huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator |
| π» GitHub source | github.com/Abrodolph/Wildfire-Containment-Simulator |
| π GRPO training notebook | training/grpo_v2_colab.ipynb |
| π SFT warm-up notebook | training/sft_colab.ipynb |
| π Long-form blog post | BLOG.md |
| π Baseline eval JSON | scripts/results.json |
| π Training dashboard | W&B run: wildfire-grpo/runs/dnz56kuu |
| π¬ Heuristic replay GIF | demos/heuristic_replay.gif |
| π₯ 2-minute pitch video | (YouTube link coming soon) |
Why Theme 2
| Pillar | How we model it |
|---|---|
| Long-horizon planning | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β greedy local moves cannot capture it. |
| Instruction following | Every episode opens with an OperationalBriefing (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
| Recovery from failure | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
Real-World Motivation
Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β so an LLM can be trained, evaluated, and inspected on it end-to-end.
For the deeper story behind the design choices, see BLOG.md.
Quickstart
# Clone and install
git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
cd Wildfire-Containment-Simulator
uv pip install -r requirements.txt
uv pip install -e .
# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
python scripts/evaluate.py 5
# Compare agents head-to-head
python scripts/eval_compare.py --seeds 42 43 44 45 46 \
--tiers easy medium hard --agents random heuristic
# Render an episode as a GIF
python scripts/replay.py --tier medium --seed 42 \
--agent heuristic --output demos/replay.gif
# Spin up the OpenEnv FastAPI server locally on port 7860
python server/app.py
# Then visit http://localhost:7860/ui/ for the interactive frontend
Full test suite: pytest tests -v (41 tests, ~30s on CPU).
Live Hugging Face Space
The environment is deployed at Eshit/Wildfire-Containment-Simulator on Hugging Face. Any external agent can drive it over plain HTTP β no Python import needed:
SPACE=https://eshit-wildfire-containment-simulator.hf.space
curl "$SPACE/health"
curl -X POST "$SPACE/reset?task_id=easy&seed=42"
curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
-d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
Endpoints: /reset, /step, /state, /state/render, /auto_step, /health, /docs (Swagger UI), /ui/ (interactive frontend).
Environment API
from env import WildfireEnv, Action, ActionType, Direction
env = WildfireEnv()
obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
while not env.done:
action = Action(
action_type=ActionType.DEPLOY_CREW,
crew_id="crew_0",
target_row=7, target_col=7,
)
result = env.step(action) # StepResult
obs = result.observation
reward = result.reward # decomposed float, range ~β8 to +8
done = result.done
state = env.state() # Full ground truth (grading only)
reset(task_id, seed) is fully deterministic. state() is intentionally exposed only for graders β agents must work from Observation.
Action Space
All actions are Pydantic-validated. Invalid actions return a penalty reward without crashing the environment.
| Action | Required parameters | Description |
|---|---|---|
deploy_crew |
crew_id, target_row, target_col |
Place an undeployed crew on a safe cell |
move_crew |
crew_id, direction (N/S/E/W/NE/NW/SE/SW) |
Move a deployed crew one cell |
order_crew_objective |
crew_id, objective (hold/advance/retreat/prioritize_*) |
Set a persistent directive for a crew's local policy |
drop_retardant |
tanker_id, target_row, target_col |
3Γ3 retardant drop with 5-step cooldown |
build_firebreak |
crew_id, direction |
Permanent non-flammable cell adjacent to a crew |
recon_flight |
target_row, target_col |
Reveal a 10Γ10 area for 5 steps |
idle |
reason (optional) |
Explicitly wait |
A 3-layer parser (env/action_parser.py) maps raw LLM output β structured Action: direct JSON β regex field extraction β safe-idle fallback. The environment loop never breaks on bad model output.
Observation Space
| Component | Contents | Noise / occlusion |
|---|---|---|
briefing |
OperationalBriefing on first obs β incident ID, priority zones, infrastructure, wind forecast |
First step only |
grid |
2D array of CellObservation (fire_state, intensity_bin, smoke_density, is_populated, crew_present) |
Smoke occlusion (medium/hard); fog-of-war (hard) |
weather |
wind_speed_kmh, wind_direction_deg, humidity_pct, rain_active |
Β±5 km/h, Β±20Β° on medium/hard |
resources |
Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
stats |
cells_burned, cells_burning, population_lost, containment_pct, current_step |
Fully observable |
recent_events |
Last 5 notable events | Fully observable |
The observation is rendered into LLM-friendly text via serialize_observation() (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is O(regions) instead of O(cells).
Reward Function
Decomposed for GRPO β wide reward range produces meaningful advantages between rollout groups.
Per-step (dense):
step_reward = 0.4 Β· Ξcontainment + 0.4 Β· Ξpopulation_safety β 0.1 Β· redundant_action_flag
Terminal (sparse, on episode end):
+5.0 if all populations safe
+0β2.0 efficiency bonus (faster containment β more)
+1.0 briefing-adherence bonus (all priority zones survived)
β3.0 Β· (pop_lost / total_pop) if any population lost
β2.0 if any crew casualty
β0.01 Γ invalid_action_count capped at β0.2
Total empirical range: β8 to +8, declared in openenv.yaml.
| Tier | Spread scale | Episode length | Approx. reward ceiling |
|---|---|---|---|
| Easy | 1.00Γ | 80 | +8 |
| Medium | 0.70Γ | 150 | +7 |
| Hard | 0.55Γ | 300 | +6 |
Three Difficulty Tiers
Task 1 β Easy: Flatland Grass Fire
15Γ15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. Focus: basic deployment and perimeter control.
Task 2 β Medium: Canyon Terrain with Wind Shifts
25Γ25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. Focus: terrain-aware containment under multi-front pressure.
Task 3 β Hard: Wildland-Urban Interface Crisis
40Γ40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. Focus: long-horizon planning under uncertainty and recovery from cascading failures.
Fire Spread Model
A Rothermel-inspired cellular automaton on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
P(ignite) = base_rate Γ fuel_factor Γ wind_factor Γ slope_factor
Γ (1 β moisture) Γ (1 β suppression) Γ tier_scale
| Factor | Effect |
|---|---|
base_rate |
Baseline spread by fuel type |
fuel_factor |
Fuel load of the target cell |
wind_factor |
Boost when wind aligns with the spread vector, dampened otherwise |
slope_factor |
Faster uphill, slower downhill |
moisture |
Wet ground / recent rain reduces ignition probability |
suppression |
Crew presence and retardant coverage reduce spread |
tier_scale |
easy=1.00, medium=0.70, hard=0.55 |
Burning cells progress through BURNING β EMBER β BURNED_OUT. Urban cells have higher peak intensity but lower ignition probability.
Results
Baselines reproduced via
python scripts/evaluate.py 5on seeds 42β46. Trained-model numbers from Section 10 oftraining/grpo_v2_colab.ipynb, evaluated on seeds 42β56 (15 per tier, no overlap with training seeds 0β99).
| Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
|---|---|---|---|
| Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
| Heuristic | +7.53 Β± 0.08 | +6.31 Β± 2.77 | +4.74 Β± 3.79 |
| Trained Qwen-2.5-7B (ours) | +5.13 Β± 3.90 | +5.74 Β± 3.07 | +2.14 Β± 2.87 |
| Ξ vs. Heuristic | β2.41 | β0.58 β | β2.59 |
The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion).
Auxiliary metrics for the trained agent:
| Metric | Easy | Medium | Hard |
|---|---|---|---|
| JSON success rate | 98.5% | 99.8% | 99.2% |
| Mean population saved % | 87% | 97% | 92% |
Curriculum progression: easy (steps 0β52) β medium (steps 53β62) β hard (steps 63β149). The model reached hard tier in just 63 of 150 training steps.
Full scores in
training/grpo_eval_results.json. Training history intraining/training_stats.json.
Training
We use a two-stage recipe:
- SFT warm-up β generate 4,300
(prompt, action_json)pairs from the heuristic on successful episodes (filtered topop_lost == 0), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (r=32, MLP+attention adapters). Notebook:training/sft_colab.ipynb. - GRPO (TRL
GRPOTrainer) β start from the SFT adapter, score completions by resetting the env to the exact(tier, seed)that produced each prompt, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL:reward_fn_outcome(full episode reward) andreward_fn_format(JSON validity). Curriculum auto-promotes easy β medium β hard. Notebook:training/grpo_v2_colab.ipynb.
Hardware: A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
Training stack: unsloth 2026.4.8 (4-bit QLoRA), trl==0.20.0, datasets==3.4.1, transformers 5.5.0, peft, wandb.
Training plots: W&B run saini-eshit-/wildfire-grpo/runs/dnz56kuu (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: training/training_dashboard.png (not tracked in git β generate with python scripts/plot_grpo_training.py).
For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read BLOG.md.
Project Structure
Wildfire-Containment-Simulator/
βββ env/
β βββ wildfire_env.py # Main env: reset(), step(), state()
β βββ models.py # Pydantic action/observation/state models
β βββ grid.py # Terrain, smoke, moisture, fog-of-war
β βββ fire_spread.py # Cellular automaton fire propagation
β βββ weather.py # Stochastic weather engine
β βββ resources.py # Crews, tankers, firebreaks, recon
β βββ reward.py # Decomposed step + terminal reward
β βββ briefing.py # OperationalBriefing generation
β βββ serialization.py # Observation β LLM prompt
β βββ action_parser.py # LLM output β Action (3-layer fallback)
β βββ rendering.py # Frame rendering for GIF replays
β βββ curriculum.py # CurriculumController (auto-promote/demote)
βββ agents/
β βββ random_agent.py
β βββ heuristic_agent.py
βββ graders/
β βββ grader_easy.py # β (total_reward, details_dict)
β βββ grader_medium.py
β βββ grader_hard.py
βββ scripts/
β βββ evaluate.py # Baseline eval (random + heuristic)
β βββ eval_compare.py # Multi-agent comparison
β βββ eval_trained_model.py # Evaluate a trained adapter
β βββ generate_sft_data.py # Build SFT dataset from heuristic rollouts
β βββ replay.py # Render episode as GIF
β βββ run_demo.py # Pitch demo
β βββ plot_dashboard.py # 4-panel training curves
βββ training/
β βββ grpo_v2_colab.ipynb # GRPO notebook (canonical)
β βββ sft_colab.ipynb # SFT warm-up notebook
β βββ sft_data.jsonl # 4,300 SFT examples
β βββ requirements.txt # Training deps (Unsloth, TRL, etc.)
β βββ README.md
βββ server/
β βββ app.py # FastAPI on port 7860
βββ frontend/ # Interactive HTML/JS frontend served at /ui/
βββ tests/ # 41 pytest tests
βββ demos/ # GIF/PNG demo assets
βββ openenv.yaml # OpenEnv environment manifest
βββ Dockerfile # HF Space build
βββ BLOG.md # Long-form write-up
βββ README.md # You are here
Architecture Decisions
- Decomposed reward for GRPO. Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, β3 Γ loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
- Operational briefings as first-class instructions. The briefing isn't cosmetic β protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
- Two-stage training (SFT β GRPO). SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
- 3-layer action parser. JSON parse β regex fallback β safe-
idle. The training loop never breaks on malformed model output. - Per-step (tier, seed) replay in the reward function. Each GRPO completion is scored by replaying the exact env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see
BLOG.mdβ "What broke"). - Deterministic seeding.
np.random.default_rng(seed)is threaded through every subsystem β every run is byte-for-byte reproducible. - OpenEnv compliance over framework lock-in. The env is callable from Python (
env.reset/step/state) and over HTTP (/reset,/step,/state). Any external agent β TRL, vLLM, an OpenAI-compatible API client, a curl loop β can drive it.
Citation
If you use this environment, please cite:
@misc{wildfire-containment-simulator-2026,
title = {Wildfire Containment Simulator: Long-Horizon Planning and
Instruction Following for Disaster-Response LLM Agents},
author = {Team Wildfire},
year = {2026},
url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
note = {Meta OpenEnv Hackathon submission, Theme 2}
}
License
MIT. Built on OpenEnv for the Meta Γ Hugging Face Γ Scaler hackathon, April 2026.