Spaces:

Eshit
/

Wildfire-Containment-Simulator

Sleeping

App Files Files Community

Wildfire-Containment-Simulator / README.md

Eshit

Update README

d377d79 verified 24 days ago

preview code

raw

history blame contribute delete

18.9 kB

	---
	title: Wildfire Containment Simulator
	emoji: 🔥
	colorFrom: red
	colorTo: purple
	sdk: docker
	pinned: false
	license: mit
	tags:
	- reinforcement-learning
	- simulation
	- openenv
	- wildfire
	- rl-environment
	- long-horizon
	- instruction-following
	---

	# Wildfire Containment Simulator

	Meta OpenEnv Hackathon — Theme 2: Long-Horizon Planning & Instruction Following

	![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
	![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
	![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
	![Python](https://img.shields.io/badge/Python-3.11+-blue)
	![License](https://img.shields.io/badge/License-MIT-green)

	A partially-observable disaster simulation where an LLM acts as Incident Commander, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.

	> Headline result (post-training run, Apr 26): Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of +5.74 on Medium tier — vs. +6.31 for the rule-based heuristic and +1.31 for the random baseline. The model auto-promoted through all three curriculum tiers (easy → medium → hard) in just 63 of 150 training steps, maintaining 99%+ JSON success rate throughout.
	> (Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)

	---

	## 🔗 Quick Links

	\| Resource \| Link \|
	\|---\|---\|
	\| 🚀 Live HF Space (env) \| [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) \|
	\| 💻 GitHub source \| [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) \|
	\| 📒 GRPO training notebook \| [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) \|
	\| 📒 SFT warm-up notebook \| [`training/sft_colab.ipynb`](training/sft_colab.ipynb) \|
	\| 📝 Long-form blog post \| [`BLOG.md`](BLOG.md) \|
	\| 📊 Baseline eval JSON \| [`scripts/results.json`](scripts/results.json) \|
	\| 📈 Training dashboard \| [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) \|
	\| 🎬 Heuristic replay GIF \| [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) \|
	\| 🎥 2-minute pitch video \| (YouTube link coming soon) \|

	---

	## Why Theme 2

	\| Pillar \| How we model it \|
	\|---\|---\|
	\| Long-horizon planning \| Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival — greedy local moves cannot capture it. \|
	\| Instruction following \| Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. \|
	\| Recovery from failure \| Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. \|

	---

	## Real-World Motivation

	Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work — partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety — so an LLM can be trained, evaluated, and inspected on it end-to-end.

	For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).

	---

	## Quickstart

	```bash
	# Clone and install
	git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
	cd Wildfire-Containment-Simulator
	uv pip install -r requirements.txt
	uv pip install -e .

	# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
	python scripts/evaluate.py 5

	# Compare agents head-to-head
	python scripts/eval_compare.py --seeds 42 43 44 45 46 \
	--tiers easy medium hard --agents random heuristic

	# Render an episode as a GIF
	python scripts/replay.py --tier medium --seed 42 \
	--agent heuristic --output demos/replay.gif

	# Spin up the OpenEnv FastAPI server locally on port 7860
	python server/app.py
	# Then visit http://localhost:7860/ui/ for the interactive frontend
	```

	Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).

	---

	## Live Hugging Face Space

	The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP — no Python import needed:

	```bash
	SPACE=https://eshit-wildfire-containment-simulator.hf.space

	curl "$SPACE/health"
	curl -X POST "$SPACE/reset?task_id=easy&seed=42"
	curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
	-d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
	```

	Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).

	---

	## Environment API

	```python
	from env import WildfireEnv, Action, ActionType, Direction

	env = WildfireEnv()
	obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)

	while not env.done:
	action = Action(
	action_type=ActionType.DEPLOY_CREW,
	crew_id="crew_0",
	target_row=7, target_col=7,
	)
	result = env.step(action) # StepResult
	obs = result.observation
	reward = result.reward # decomposed float, range ~−8 to +8
	done = result.done

	state = env.state() # Full ground truth (grading only)
	```

	`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders — agents must work from `Observation`.

	---

	## Action Space

	All actions are Pydantic-validated. Invalid actions return a penalty reward without crashing the environment.

	\| Action \| Required parameters \| Description \|
	\|---\|---\|---\|
	\| `deploy_crew` \| `crew_id`, `target_row`, `target_col` \| Place an undeployed crew on a safe cell \|
	\| `move_crew` \| `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) \| Move a deployed crew one cell \|
	\| `order_crew_objective` \| `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) \| Set a persistent directive for a crew's local policy \|
	\| `drop_retardant` \| `tanker_id`, `target_row`, `target_col` \| 3×3 retardant drop with 5-step cooldown \|
	\| `build_firebreak` \| `crew_id`, `direction` \| Permanent non-flammable cell adjacent to a crew \|
	\| `recon_flight` \| `target_row`, `target_col` \| Reveal a 10×10 area for 5 steps \|
	\| `idle` \| `reason` (optional) \| Explicitly wait \|

	A 3-layer parser (`env/action_parser.py`) maps raw LLM output → structured `Action`: direct JSON → regex field extraction → safe-`idle` fallback. The environment loop never breaks on bad model output.

	---

	## Observation Space

	\| Component \| Contents \| Noise / occlusion \|
	\|---\|---\|---\|
	\| `briefing` \| `OperationalBriefing` on first obs — incident ID, priority zones, infrastructure, wind forecast \| First step only \|
	\| `grid` \| 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) \| Smoke occlusion (medium/hard); fog-of-war (hard) \|
	\| `weather` \| `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` \| ±5 km/h, ±20° on medium/hard \|
	\| `resources` \| Crew positions, tanker cooldowns, firebreak budget, recon budget \| Fully observable \|
	\| `stats` \| `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` \| Fully observable \|
	\| `recent_events` \| Last 5 notable events \| Fully observable \|

	The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.

	---

	## Reward Function

	Decomposed for GRPO — wide reward range produces meaningful advantages between rollout groups.

	Per-step (dense):
	```
	step_reward = 0.4 · Δcontainment + 0.4 · Δpopulation_safety − 0.1 · redundant_action_flag
	```

	Terminal (sparse, on episode end):
	```
	+5.0 if all populations safe
	+0–2.0 efficiency bonus (faster containment ⇒ more)
	+1.0 briefing-adherence bonus (all priority zones survived)
	−3.0 · (pop_lost / total_pop) if any population lost
	−2.0 if any crew casualty
	−0.01 × invalid_action_count capped at −0.2
	```

	Total empirical range: −8 to +8, declared in `openenv.yaml`.

	\| Tier \| Spread scale \| Episode length \| Approx. reward ceiling \|
	\|---\|---\|---\|---\|
	\| Easy \| 1.00× \| 80 \| +8 \|
	\| Medium \| 0.70× \| 150 \| +7 \|
	\| Hard \| 0.55× \| 300 \| +6 \|

	---

	## Three Difficulty Tiers

	### Task 1 — Easy: Flatland Grass Fire
	15×15 flat grid · single ignition · constant wind · no smoke or fog-of-war · 4 crews, 1 tanker, 15 firebreak cells · 80 steps. Focus: basic deployment and perimeter control.

	### Task 2 — Medium: Canyon Terrain with Wind Shifts
	25×25 mixed terrain · two ignition points · variable wind · smoke occlusion · sensor noise · 5 crews, 2 tankers, 20 firebreak cells, 1 recon · 150 steps. Focus: terrain-aware containment under multi-front pressure.

	### Task 3 — Hard: Wildland-Urban Interface Crisis
	40×40 terrain with roads, rivers, urban zones · staggered ignitions (step 30) · scripted crew casualty (step 40) · fog-of-war (radius 7) · aggressive wind shifts · 6 crews, 3 tankers, 30 firebreak cells, 3 recon · 300 steps. Focus: long-horizon planning under uncertainty and recovery from cascading failures.

	---

	## Fire Spread Model

	A Rothermel-inspired cellular automaton on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:

	```
	P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
	× (1 − moisture) × (1 − suppression) × tier_scale
	```

	\| Factor \| Effect \|
	\|---\|---\|
	\| `base_rate` \| Baseline spread by fuel type \|
	\| `fuel_factor` \| Fuel load of the target cell \|
	\| `wind_factor` \| Boost when wind aligns with the spread vector, dampened otherwise \|
	\| `slope_factor` \| Faster uphill, slower downhill \|
	\| `moisture` \| Wet ground / recent rain reduces ignition probability \|
	\| `suppression` \| Crew presence and retardant coverage reduce spread \|
	\| `tier_scale` \| `easy=1.00`, `medium=0.70`, `hard=0.55` \|

	Burning cells progress through `BURNING → EMBER → BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.

	---

	## Results

	> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99).

	\| Agent \| Easy (mean ± std) \| Medium (mean ± std) \| Hard (mean ± std) \|
	\|---\|---\|---\|---\|
	\| Random \| +6.23 ± 3.09 \| +1.31 ± 3.24 \| +2.16 ± 2.96 \|
	\| Heuristic \| +7.53 ± 0.08 \| +6.31 ± 2.77 \| +4.74 ± 3.79 \|
	\| Trained Qwen-2.5-7B (ours) \| +5.13 ± 3.90 \| +5.74 ± 3.07 \| +2.14 ± 2.87 \|
	\| Δ vs. Heuristic \| −2.41 \| −0.58 ✓ \| −2.59 \|

	The medium tier result passes the ±1.0 of heuristic threshold (official passing criterion).

	Auxiliary metrics for the trained agent:

	\| Metric \| Easy \| Medium \| Hard \|
	\|---\|---\|---\|---\|
	\| JSON success rate \| 98.5% \| 99.8% \| 99.2% \|
	\| Mean population saved % \| 87% \| 97% \| 92% \|

	Curriculum progression: easy (steps 0–52) → medium (steps 53–62) → hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps.

	> Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).

	---

	## Training

	We use a two-stage recipe:

	1. SFT warm-up — generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
	2. GRPO (TRL `GRPOTrainer`) — start from the SFT adapter, score completions by resetting the env to the exact `(tier, seed)` that produced each prompt, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy → medium → hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).

	Hardware: A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
	Training stack: `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.

	Training plots: W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git — generate with `python scripts/plot_grpo_training.py`).

	For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).

	---

	## Project Structure

	```text
	Wildfire-Containment-Simulator/
	├── env/
	│ ├── wildfire_env.py # Main env: reset(), step(), state()
	│ ├── models.py # Pydantic action/observation/state models
	│ ├── grid.py # Terrain, smoke, moisture, fog-of-war
	│ ├── fire_spread.py # Cellular automaton fire propagation
	│ ├── weather.py # Stochastic weather engine
	│ ├── resources.py # Crews, tankers, firebreaks, recon
	│ ├── reward.py # Decomposed step + terminal reward
	│ ├── briefing.py # OperationalBriefing generation
	│ ├── serialization.py # Observation → LLM prompt
	│ ├── action_parser.py # LLM output → Action (3-layer fallback)
	│ ├── rendering.py # Frame rendering for GIF replays
	│ └── curriculum.py # CurriculumController (auto-promote/demote)
	├── agents/
	│ ├── random_agent.py
	│ └── heuristic_agent.py
	├── graders/
	│ ├── grader_easy.py # → (total_reward, details_dict)
	│ ├── grader_medium.py
	│ └── grader_hard.py
	├── scripts/
	│ ├── evaluate.py # Baseline eval (random + heuristic)
	│ ├── eval_compare.py # Multi-agent comparison
	│ ├── eval_trained_model.py # Evaluate a trained adapter
	│ ├── generate_sft_data.py # Build SFT dataset from heuristic rollouts
	│ ├── replay.py # Render episode as GIF
	│ ├── run_demo.py # Pitch demo
	│ └── plot_dashboard.py # 4-panel training curves
	├── training/
	│ ├── grpo_v2_colab.ipynb # GRPO notebook (canonical)
	│ ├── sft_colab.ipynb # SFT warm-up notebook
	│ ├── sft_data.jsonl # 4,300 SFT examples
	│ ├── requirements.txt # Training deps (Unsloth, TRL, etc.)
	│ └── README.md
	├── server/
	│ └── app.py # FastAPI on port 7860
	├── frontend/ # Interactive HTML/JS frontend served at /ui/
	├── tests/ # 41 pytest tests
	├── demos/ # GIF/PNG demo assets
	├── openenv.yaml # OpenEnv environment manifest
	├── Dockerfile # HF Space build
	├── BLOG.md # Long-form write-up
	└── README.md # You are here
	```

	---

	## Architecture Decisions

	1. Decomposed reward for GRPO. Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, −3 × loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
	2. Operational briefings as first-class instructions. The briefing isn't cosmetic — protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
	3. Two-stage training (SFT → GRPO). SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
	4. 3-layer action parser. JSON parse → regex fallback → safe-`idle`. The training loop never breaks on malformed model output.
	5. Per-step (tier, seed) replay in the reward function. Each GRPO completion is scored by replaying the exact env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) → "What broke").
	6. Deterministic seeding. `np.random.default_rng(seed)` is threaded through every subsystem — every run is byte-for-byte reproducible.
	7. OpenEnv compliance over framework lock-in. The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent — TRL, vLLM, an OpenAI-compatible API client, a curl loop — can drive it.

	---

	## Citation

	If you use this environment, please cite:

	```bibtex
	@misc{wildfire-containment-simulator-2026,
	title = {Wildfire Containment Simulator: Long-Horizon Planning and
	Instruction Following for Disaster-Response LLM Agents},
	author = {Team Wildfire},
	year = {2026},
	url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
	note = {Meta OpenEnv Hackathon submission, Theme 2}
	}
	```

	---

	## License

	[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta × Hugging Face × Scaler hackathon, April 2026.