Eshit's picture
Update README
d377d79 verified
metadata
title: Wildfire Containment Simulator
emoji: πŸ”₯
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
  - reinforcement-learning
  - simulation
  - openenv
  - wildfire
  - rl-environment
  - long-horizon
  - instruction-following

Wildfire Containment Simulator

Meta OpenEnv Hackathon β€” Theme 2: Long-Horizon Planning & Instruction Following

CI OpenEnv Theme Python License

A partially-observable disaster simulation where an LLM acts as Incident Commander, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.

Headline result (post-training run, Apr 26): Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of +5.74 on Medium tier β€” vs. +6.31 for the rule-based heuristic and +1.31 for the random baseline. The model auto-promoted through all three curriculum tiers (easy β†’ medium β†’ hard) in just 63 of 150 training steps, maintaining 99%+ JSON success rate throughout. (Full comparison table in Results. Model: Eshit/wildfire-grpo-7b. W&B run: wildfire-grpo/runs/dnz56kuu.)


πŸ”— Quick Links

Resource Link
πŸš€ Live HF Space (env) huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator
πŸ’» GitHub source github.com/Abrodolph/Wildfire-Containment-Simulator
πŸ“’ GRPO training notebook training/grpo_v2_colab.ipynb
πŸ“’ SFT warm-up notebook training/sft_colab.ipynb
πŸ“ Long-form blog post BLOG.md
πŸ“Š Baseline eval JSON scripts/results.json
πŸ“ˆ Training dashboard W&B run: wildfire-grpo/runs/dnz56kuu
🎬 Heuristic replay GIF demos/heuristic_replay.gif
πŸŽ₯ 2-minute pitch video (YouTube link coming soon)

Why Theme 2

Pillar How we model it
Long-horizon planning Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β€” greedy local moves cannot capture it.
Instruction following Every episode opens with an OperationalBriefing (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones.
Recovery from failure Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population.

Real-World Motivation

Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β€” partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β€” so an LLM can be trained, evaluated, and inspected on it end-to-end.

For the deeper story behind the design choices, see BLOG.md.


Quickstart

# Clone and install
git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
cd Wildfire-Containment-Simulator
uv pip install -r requirements.txt
uv pip install -e .

# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
python scripts/evaluate.py 5

# Compare agents head-to-head
python scripts/eval_compare.py --seeds 42 43 44 45 46 \
    --tiers easy medium hard --agents random heuristic

# Render an episode as a GIF
python scripts/replay.py --tier medium --seed 42 \
    --agent heuristic --output demos/replay.gif

# Spin up the OpenEnv FastAPI server locally on port 7860
python server/app.py
# Then visit http://localhost:7860/ui/ for the interactive frontend

Full test suite: pytest tests -v (41 tests, ~30s on CPU).


Live Hugging Face Space

The environment is deployed at Eshit/Wildfire-Containment-Simulator on Hugging Face. Any external agent can drive it over plain HTTP β€” no Python import needed:

SPACE=https://eshit-wildfire-containment-simulator.hf.space

curl "$SPACE/health"
curl -X POST "$SPACE/reset?task_id=easy&seed=42"
curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
    -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'

Endpoints: /reset, /step, /state, /state/render, /auto_step, /health, /docs (Swagger UI), /ui/ (interactive frontend).


Environment API

from env import WildfireEnv, Action, ActionType, Direction

env = WildfireEnv()
obs = env.reset(task_id="easy", seed=42)   # Observation (with OperationalBriefing on first step)

while not env.done:
    action = Action(
        action_type=ActionType.DEPLOY_CREW,
        crew_id="crew_0",
        target_row=7, target_col=7,
    )
    result = env.step(action)               # StepResult
    obs = result.observation
    reward = result.reward                  # decomposed float, range ~βˆ’8 to +8
    done = result.done

state = env.state()                          # Full ground truth (grading only)

reset(task_id, seed) is fully deterministic. state() is intentionally exposed only for graders β€” agents must work from Observation.


Action Space

All actions are Pydantic-validated. Invalid actions return a penalty reward without crashing the environment.

Action Required parameters Description
deploy_crew crew_id, target_row, target_col Place an undeployed crew on a safe cell
move_crew crew_id, direction (N/S/E/W/NE/NW/SE/SW) Move a deployed crew one cell
order_crew_objective crew_id, objective (hold/advance/retreat/prioritize_*) Set a persistent directive for a crew's local policy
drop_retardant tanker_id, target_row, target_col 3Γ—3 retardant drop with 5-step cooldown
build_firebreak crew_id, direction Permanent non-flammable cell adjacent to a crew
recon_flight target_row, target_col Reveal a 10Γ—10 area for 5 steps
idle reason (optional) Explicitly wait

A 3-layer parser (env/action_parser.py) maps raw LLM output β†’ structured Action: direct JSON β†’ regex field extraction β†’ safe-idle fallback. The environment loop never breaks on bad model output.


Observation Space

Component Contents Noise / occlusion
briefing OperationalBriefing on first obs β€” incident ID, priority zones, infrastructure, wind forecast First step only
grid 2D array of CellObservation (fire_state, intensity_bin, smoke_density, is_populated, crew_present) Smoke occlusion (medium/hard); fog-of-war (hard)
weather wind_speed_kmh, wind_direction_deg, humidity_pct, rain_active Β±5 km/h, Β±20Β° on medium/hard
resources Crew positions, tanker cooldowns, firebreak budget, recon budget Fully observable
stats cells_burned, cells_burning, population_lost, containment_pct, current_step Fully observable
recent_events Last 5 notable events Fully observable

The observation is rendered into LLM-friendly text via serialize_observation() (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is O(regions) instead of O(cells).


Reward Function

Decomposed for GRPO β€” wide reward range produces meaningful advantages between rollout groups.

Per-step (dense):

step_reward = 0.4 Β· Ξ”containment + 0.4 Β· Ξ”population_safety βˆ’ 0.1 Β· redundant_action_flag

Terminal (sparse, on episode end):

+5.0   if all populations safe
+0–2.0 efficiency bonus (faster containment β‡’ more)
+1.0   briefing-adherence bonus (all priority zones survived)
βˆ’3.0 Β· (pop_lost / total_pop)   if any population lost
βˆ’2.0   if any crew casualty
βˆ’0.01 Γ— invalid_action_count    capped at βˆ’0.2

Total empirical range: βˆ’8 to +8, declared in openenv.yaml.

Tier Spread scale Episode length Approx. reward ceiling
Easy 1.00Γ— 80 +8
Medium 0.70Γ— 150 +7
Hard 0.55Γ— 300 +6

Three Difficulty Tiers

Task 1 β€” Easy: Flatland Grass Fire

15Γ—15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. Focus: basic deployment and perimeter control.

Task 2 β€” Medium: Canyon Terrain with Wind Shifts

25Γ—25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. Focus: terrain-aware containment under multi-front pressure.

Task 3 β€” Hard: Wildland-Urban Interface Crisis

40Γ—40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. Focus: long-horizon planning under uncertainty and recovery from cascading failures.


Fire Spread Model

A Rothermel-inspired cellular automaton on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:

P(ignite) = base_rate Γ— fuel_factor Γ— wind_factor Γ— slope_factor
            Γ— (1 βˆ’ moisture) Γ— (1 βˆ’ suppression) Γ— tier_scale
Factor Effect
base_rate Baseline spread by fuel type
fuel_factor Fuel load of the target cell
wind_factor Boost when wind aligns with the spread vector, dampened otherwise
slope_factor Faster uphill, slower downhill
moisture Wet ground / recent rain reduces ignition probability
suppression Crew presence and retardant coverage reduce spread
tier_scale easy=1.00, medium=0.70, hard=0.55

Burning cells progress through BURNING β†’ EMBER β†’ BURNED_OUT. Urban cells have higher peak intensity but lower ignition probability.


Results

Baselines reproduced via python scripts/evaluate.py 5 on seeds 42–46. Trained-model numbers from Section 10 of training/grpo_v2_colab.ipynb, evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99).

Agent Easy (mean Β± std) Medium (mean Β± std) Hard (mean Β± std)
Random +6.23 Β± 3.09 +1.31 Β± 3.24 +2.16 Β± 2.96
Heuristic +7.53 Β± 0.08 +6.31 Β± 2.77 +4.74 Β± 3.79
Trained Qwen-2.5-7B (ours) +5.13 Β± 3.90 +5.74 Β± 3.07 +2.14 Β± 2.87
Ξ” vs. Heuristic βˆ’2.41 βˆ’0.58 βœ“ βˆ’2.59

The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion).

Auxiliary metrics for the trained agent:

Metric Easy Medium Hard
JSON success rate 98.5% 99.8% 99.2%
Mean population saved % 87% 97% 92%

Curriculum progression: easy (steps 0–52) β†’ medium (steps 53–62) β†’ hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps.

Full scores in training/grpo_eval_results.json. Training history in training/training_stats.json.


Training

We use a two-stage recipe:

  1. SFT warm-up β€” generate 4,300 (prompt, action_json) pairs from the heuristic on successful episodes (filtered to pop_lost == 0), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (r=32, MLP+attention adapters). Notebook: training/sft_colab.ipynb.
  2. GRPO (TRL GRPOTrainer) β€” start from the SFT adapter, score completions by resetting the env to the exact (tier, seed) that produced each prompt, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: reward_fn_outcome (full episode reward) and reward_fn_format (JSON validity). Curriculum auto-promotes easy β†’ medium β†’ hard. Notebook: training/grpo_v2_colab.ipynb.

Hardware: A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time. Training stack: unsloth 2026.4.8 (4-bit QLoRA), trl==0.20.0, datasets==3.4.1, transformers 5.5.0, peft, wandb.

Training plots: W&B run saini-eshit-/wildfire-grpo/runs/dnz56kuu (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: training/training_dashboard.png (not tracked in git β€” generate with python scripts/plot_grpo_training.py).

For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read BLOG.md.


Project Structure

Wildfire-Containment-Simulator/
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ wildfire_env.py       # Main env: reset(), step(), state()
β”‚   β”œβ”€β”€ models.py             # Pydantic action/observation/state models
β”‚   β”œβ”€β”€ grid.py               # Terrain, smoke, moisture, fog-of-war
β”‚   β”œβ”€β”€ fire_spread.py        # Cellular automaton fire propagation
β”‚   β”œβ”€β”€ weather.py            # Stochastic weather engine
β”‚   β”œβ”€β”€ resources.py          # Crews, tankers, firebreaks, recon
β”‚   β”œβ”€β”€ reward.py             # Decomposed step + terminal reward
β”‚   β”œβ”€β”€ briefing.py           # OperationalBriefing generation
β”‚   β”œβ”€β”€ serialization.py      # Observation β†’ LLM prompt
β”‚   β”œβ”€β”€ action_parser.py      # LLM output β†’ Action (3-layer fallback)
β”‚   β”œβ”€β”€ rendering.py          # Frame rendering for GIF replays
β”‚   └── curriculum.py         # CurriculumController (auto-promote/demote)
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ random_agent.py
β”‚   └── heuristic_agent.py
β”œβ”€β”€ graders/
β”‚   β”œβ”€β”€ grader_easy.py        # β†’ (total_reward, details_dict)
β”‚   β”œβ”€β”€ grader_medium.py
β”‚   └── grader_hard.py
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ evaluate.py           # Baseline eval (random + heuristic)
β”‚   β”œβ”€β”€ eval_compare.py       # Multi-agent comparison
β”‚   β”œβ”€β”€ eval_trained_model.py # Evaluate a trained adapter
β”‚   β”œβ”€β”€ generate_sft_data.py  # Build SFT dataset from heuristic rollouts
β”‚   β”œβ”€β”€ replay.py             # Render episode as GIF
β”‚   β”œβ”€β”€ run_demo.py           # Pitch demo
β”‚   └── plot_dashboard.py     # 4-panel training curves
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ grpo_v2_colab.ipynb   # GRPO notebook (canonical)
β”‚   β”œβ”€β”€ sft_colab.ipynb       # SFT warm-up notebook
β”‚   β”œβ”€β”€ sft_data.jsonl        # 4,300 SFT examples
β”‚   β”œβ”€β”€ requirements.txt      # Training deps (Unsloth, TRL, etc.)
β”‚   └── README.md
β”œβ”€β”€ server/
β”‚   └── app.py                # FastAPI on port 7860
β”œβ”€β”€ frontend/                 # Interactive HTML/JS frontend served at /ui/
β”œβ”€β”€ tests/                    # 41 pytest tests
β”œβ”€β”€ demos/                    # GIF/PNG demo assets
β”œβ”€β”€ openenv.yaml              # OpenEnv environment manifest
β”œβ”€β”€ Dockerfile                # HF Space build
β”œβ”€β”€ BLOG.md                   # Long-form write-up
└── README.md                 # You are here

Architecture Decisions

  1. Decomposed reward for GRPO. Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, βˆ’3 Γ— loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
  2. Operational briefings as first-class instructions. The briefing isn't cosmetic β€” protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
  3. Two-stage training (SFT β†’ GRPO). SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
  4. 3-layer action parser. JSON parse β†’ regex fallback β†’ safe-idle. The training loop never breaks on malformed model output.
  5. Per-step (tier, seed) replay in the reward function. Each GRPO completion is scored by replaying the exact env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see BLOG.md β†’ "What broke").
  6. Deterministic seeding. np.random.default_rng(seed) is threaded through every subsystem β€” every run is byte-for-byte reproducible.
  7. OpenEnv compliance over framework lock-in. The env is callable from Python (env.reset/step/state) and over HTTP (/reset, /step, /state). Any external agent β€” TRL, vLLM, an OpenAI-compatible API client, a curl loop β€” can drive it.

Citation

If you use this environment, please cite:

@misc{wildfire-containment-simulator-2026,
  title  = {Wildfire Containment Simulator: Long-Horizon Planning and
            Instruction Following for Disaster-Response LLM Agents},
  author = {Team Wildfire},
  year   = {2026},
  url    = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
  note   = {Meta OpenEnv Hackathon submission, Theme 2}
}

License

MIT. Built on OpenEnv for the Meta Γ— Hugging Face Γ— Scaler hackathon, April 2026.