Spaces:

Eshit
/

Wildfire-Containment-Simulator

Sleeping

App Files Files Community

Wildfire-Containment-Simulator / README.md

Eshit

Update README

d377d79 verified 24 days ago

preview code

raw

history blame contribute delete

18.9 kB

metadata

title: Wildfire Containment Simulator
emoji: 🔥
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
  - reinforcement-learning
  - simulation
  - openenv
  - wildfire
  - rl-environment
  - long-horizon
  - instruction-following

Wildfire Containment Simulator

Meta OpenEnv Hackathon — Theme 2: Long-Horizon Planning & Instruction Following

A partially-observable disaster simulation where an LLM acts as Incident Commander, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.

Headline result (post-training run, Apr 26): Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of +5.74 on Medium tier — vs. +6.31 for the rule-based heuristic and +1.31 for the random baseline. The model auto-promoted through all three curriculum tiers (easy → medium → hard) in just 63 of 150 training steps, maintaining 99%+ JSON success rate throughout. (Full comparison table in Results. Model: Eshit/wildfire-grpo-7b. W&B run: wildfire-grpo/runs/dnz56kuu.)

🔗 Quick Links

Resource	Link
🚀 Live HF Space (env)	huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator
💻 GitHub source	github.com/Abrodolph/Wildfire-Containment-Simulator
📒 GRPO training notebook	`training/grpo_v2_colab.ipynb`
📒 SFT warm-up notebook	`training/sft_colab.ipynb`
📝 Long-form blog post	`BLOG.md`
📊 Baseline eval JSON	`scripts/results.json`
📈 Training dashboard	W&B run: wildfire-grpo/runs/dnz56kuu
🎬 Heuristic replay GIF	`demos/heuristic_replay.gif`
🎥 2-minute pitch video	(YouTube link coming soon)

Why Theme 2

Pillar	How we model it
Long-horizon planning	Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival — greedy local moves cannot capture it.
Instruction following	Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones.
Recovery from failure	Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population.

Real-World Motivation

Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work — partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety — so an LLM can be trained, evaluated, and inspected on it end-to-end.

For the deeper story behind the design choices, see BLOG.md.

Quickstart

# Clone and install
git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
cd Wildfire-Containment-Simulator
uv pip install -r requirements.txt
uv pip install -e .

# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
python scripts/evaluate.py 5

# Compare agents head-to-head
python scripts/eval_compare.py --seeds 42 43 44 45 46 \
    --tiers easy medium hard --agents random heuristic

# Render an episode as a GIF
python scripts/replay.py --tier medium --seed 42 \
    --agent heuristic --output demos/replay.gif

# Spin up the OpenEnv FastAPI server locally on port 7860
python server/app.py
# Then visit http://localhost:7860/ui/ for the interactive frontend

Full test suite: pytest tests -v (41 tests, ~30s on CPU).

Live Hugging Face Space

The environment is deployed at Eshit/Wildfire-Containment-Simulator on Hugging Face. Any external agent can drive it over plain HTTP — no Python import needed:

SPACE=https://eshit-wildfire-containment-simulator.hf.space

curl "$SPACE/health"
curl -X POST "$SPACE/reset?task_id=easy&seed=42"
curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
    -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'

Endpoints: /reset, /step, /state, /state/render, /auto_step, /health, /docs (Swagger UI), /ui/ (interactive frontend).

Environment API

from env import WildfireEnv, Action, ActionType, Direction

env = WildfireEnv()
obs = env.reset(task_id="easy", seed=42)   # Observation (with OperationalBriefing on first step)

while not env.done:
    action = Action(
        action_type=ActionType.DEPLOY_CREW,
        crew_id="crew_0",
        target_row=7, target_col=7,
    )
    result = env.step(action)               # StepResult
    obs = result.observation
    reward = result.reward                  # decomposed float, range ~−8 to +8
    done = result.done

state = env.state()                          # Full ground truth (grading only)

reset(task_id, seed) is fully deterministic. state() is intentionally exposed only for graders — agents must work from Observation.

Action Space

All actions are Pydantic-validated. Invalid actions return a penalty reward without crashing the environment.

Action	Required parameters	Description
`deploy_crew`	`crew_id`, `target_row`, `target_col`	Place an undeployed crew on a safe cell
`move_crew`	`crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`)	Move a deployed crew one cell
`order_crew_objective`	`crew_id`, `objective` (`hold/advance/retreat/prioritize_*`)	Set a persistent directive for a crew's local policy
`drop_retardant`	`tanker_id`, `target_row`, `target_col`	3×3 retardant drop with 5-step cooldown
`build_firebreak`	`crew_id`, `direction`	Permanent non-flammable cell adjacent to a crew
`recon_flight`	`target_row`, `target_col`	Reveal a 10×10 area for 5 steps
`idle`	`reason` (optional)	Explicitly wait

A 3-layer parser (env/action_parser.py) maps raw LLM output → structured Action: direct JSON → regex field extraction → safe-idle fallback. The environment loop never breaks on bad model output.

Observation Space

Component	Contents	Noise / occlusion
`briefing`	`OperationalBriefing` on first obs — incident ID, priority zones, infrastructure, wind forecast	First step only
`grid`	2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`)	Smoke occlusion (medium/hard); fog-of-war (hard)
`weather`	`wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active`	±5 km/h, ±20° on medium/hard
`resources`	Crew positions, tanker cooldowns, firebreak budget, recon budget	Fully observable
`stats`	`cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step`	Fully observable
`recent_events`	Last 5 notable events	Fully observable

The observation is rendered into LLM-friendly text via serialize_observation() (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is O(regions) instead of O(cells).

Reward Function

Decomposed for GRPO — wide reward range produces meaningful advantages between rollout groups.

Per-step (dense):

step_reward = 0.4 · Δcontainment + 0.4 · Δpopulation_safety − 0.1 · redundant_action_flag

Terminal (sparse, on episode end):

+5.0   if all populations safe
+0–2.0 efficiency bonus (faster containment ⇒ more)
+1.0   briefing-adherence bonus (all priority zones survived)
−3.0 · (pop_lost / total_pop)   if any population lost
−2.0   if any crew casualty
−0.01 × invalid_action_count    capped at −0.2

Total empirical range: −8 to +8, declared in openenv.yaml.

Tier	Spread scale	Episode length	Approx. reward ceiling
Easy	1.00×	80	+8
Medium	0.70×	150	+7
Hard	0.55×	300	+6

Three Difficulty Tiers

Task 1 — Easy: Flatland Grass Fire

15×15 flat grid · single ignition · constant wind · no smoke or fog-of-war · 4 crews, 1 tanker, 15 firebreak cells · 80 steps. Focus: basic deployment and perimeter control.

Task 2 — Medium: Canyon Terrain with Wind Shifts

25×25 mixed terrain · two ignition points · variable wind · smoke occlusion · sensor noise · 5 crews, 2 tankers, 20 firebreak cells, 1 recon · 150 steps. Focus: terrain-aware containment under multi-front pressure.

Task 3 — Hard: Wildland-Urban Interface Crisis

40×40 terrain with roads, rivers, urban zones · staggered ignitions (step 30) · scripted crew casualty (step 40) · fog-of-war (radius 7) · aggressive wind shifts · 6 crews, 3 tankers, 30 firebreak cells, 3 recon · 300 steps. Focus: long-horizon planning under uncertainty and recovery from cascading failures.

Fire Spread Model

A Rothermel-inspired cellular automaton on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:

P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
            × (1 − moisture) × (1 − suppression) × tier_scale

Factor	Effect
`base_rate`	Baseline spread by fuel type
`fuel_factor`	Fuel load of the target cell
`wind_factor`	Boost when wind aligns with the spread vector, dampened otherwise
`slope_factor`	Faster uphill, slower downhill
`moisture`	Wet ground / recent rain reduces ignition probability
`suppression`	Crew presence and retardant coverage reduce spread
`tier_scale`	`easy=1.00`, `medium=0.70`, `hard=0.55`

Burning cells progress through BURNING → EMBER → BURNED_OUT. Urban cells have higher peak intensity but lower ignition probability.

Results

Baselines reproduced via python scripts/evaluate.py 5 on seeds 42–46. Trained-model numbers from Section 10 of training/grpo_v2_colab.ipynb, evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99).

Agent	Easy (mean ± std)	Medium (mean ± std)	Hard (mean ± std)
Random	+6.23 ± 3.09	+1.31 ± 3.24	+2.16 ± 2.96
Heuristic	+7.53 ± 0.08	+6.31 ± 2.77	+4.74 ± 3.79
Trained Qwen-2.5-7B (ours)	+5.13 ± 3.90	+5.74 ± 3.07	+2.14 ± 2.87
Δ vs. Heuristic	−2.41	−0.58 ✓	−2.59

The medium tier result passes the ±1.0 of heuristic threshold (official passing criterion).

Auxiliary metrics for the trained agent:

Metric	Easy	Medium	Hard
JSON success rate	98.5%	99.8%	99.2%
Mean population saved %	87%	97%	92%

Curriculum progression: easy (steps 0–52) → medium (steps 53–62) → hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps.

Full scores in training/grpo_eval_results.json. Training history in training/training_stats.json.

Training

We use a two-stage recipe:

SFT warm-up — generate 4,300 (prompt, action_json) pairs from the heuristic on successful episodes (filtered to pop_lost == 0), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (r=32, MLP+attention adapters). Notebook: training/sft_colab.ipynb.
GRPO (TRL GRPOTrainer) — start from the SFT adapter, score completions by resetting the env to the exact (tier, seed) that produced each prompt, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: reward_fn_outcome (full episode reward) and reward_fn_format (JSON validity). Curriculum auto-promotes easy → medium → hard. Notebook: training/grpo_v2_colab.ipynb.

Hardware: A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time. Training stack: unsloth 2026.4.8 (4-bit QLoRA), trl==0.20.0, datasets==3.4.1, transformers 5.5.0, peft, wandb.

Training plots: W&B run saini-eshit-/wildfire-grpo/runs/dnz56kuu (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: training/training_dashboard.png (not tracked in git — generate with python scripts/plot_grpo_training.py).

For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read BLOG.md.

Project Structure

Wildfire-Containment-Simulator/
├── env/
│   ├── wildfire_env.py       # Main env: reset(), step(), state()
│   ├── models.py             # Pydantic action/observation/state models
│   ├── grid.py               # Terrain, smoke, moisture, fog-of-war
│   ├── fire_spread.py        # Cellular automaton fire propagation
│   ├── weather.py            # Stochastic weather engine
│   ├── resources.py          # Crews, tankers, firebreaks, recon
│   ├── reward.py             # Decomposed step + terminal reward
│   ├── briefing.py           # OperationalBriefing generation
│   ├── serialization.py      # Observation → LLM prompt
│   ├── action_parser.py      # LLM output → Action (3-layer fallback)
│   ├── rendering.py          # Frame rendering for GIF replays
│   └── curriculum.py         # CurriculumController (auto-promote/demote)
├── agents/
│   ├── random_agent.py
│   └── heuristic_agent.py
├── graders/
│   ├── grader_easy.py        # → (total_reward, details_dict)
│   ├── grader_medium.py
│   └── grader_hard.py
├── scripts/
│   ├── evaluate.py           # Baseline eval (random + heuristic)
│   ├── eval_compare.py       # Multi-agent comparison
│   ├── eval_trained_model.py # Evaluate a trained adapter
│   ├── generate_sft_data.py  # Build SFT dataset from heuristic rollouts
│   ├── replay.py             # Render episode as GIF
│   ├── run_demo.py           # Pitch demo
│   └── plot_dashboard.py     # 4-panel training curves
├── training/
│   ├── grpo_v2_colab.ipynb   # GRPO notebook (canonical)
│   ├── sft_colab.ipynb       # SFT warm-up notebook
│   ├── sft_data.jsonl        # 4,300 SFT examples
│   ├── requirements.txt      # Training deps (Unsloth, TRL, etc.)
│   └── README.md
├── server/
│   └── app.py                # FastAPI on port 7860
├── frontend/                 # Interactive HTML/JS frontend served at /ui/
├── tests/                    # 41 pytest tests
├── demos/                    # GIF/PNG demo assets
├── openenv.yaml              # OpenEnv environment manifest
├── Dockerfile                # HF Space build
├── BLOG.md                   # Long-form write-up
└── README.md                 # You are here

Architecture Decisions

Decomposed reward for GRPO. Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, −3 × loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
Operational briefings as first-class instructions. The briefing isn't cosmetic — protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
Two-stage training (SFT → GRPO). SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
3-layer action parser. JSON parse → regex fallback → safe-idle. The training loop never breaks on malformed model output.
Per-step (tier, seed) replay in the reward function. Each GRPO completion is scored by replaying the exact env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see BLOG.md → "What broke").
Deterministic seeding. np.random.default_rng(seed) is threaded through every subsystem — every run is byte-for-byte reproducible.
OpenEnv compliance over framework lock-in. The env is callable from Python (env.reset/step/state) and over HTTP (/reset, /step, /state). Any external agent — TRL, vLLM, an OpenAI-compatible API client, a curl loop — can drive it.

Citation

If you use this environment, please cite:

@misc{wildfire-containment-simulator-2026,
  title  = {Wildfire Containment Simulator: Long-Horizon Planning and
            Instruction Following for Disaster-Response LLM Agents},
  author = {Team Wildfire},
  year   = {2026},
  url    = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
  note   = {Meta OpenEnv Hackathon submission, Theme 2}
}

License

MIT. Built on OpenEnv for the Meta × Hugging Face × Scaler hackathon, April 2026.