--- title: Wildfire Containment Simulator emoji: πŸ”₯ colorFrom: red colorTo: purple sdk: docker pinned: false license: mit tags: - reinforcement-learning - simulation - openenv - wildfire - rl-environment - long-horizon - instruction-following --- # Wildfire Containment Simulator **Meta OpenEnv Hackathon β€” Theme 2: Long-Horizon Planning & Instruction Following** ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg) ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue) ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange) ![Python](https://img.shields.io/badge/Python-3.11+-blue) ![License](https://img.shields.io/badge/License-MIT-green) A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO. > **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier β€” vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy β†’ medium β†’ hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout. > *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)* --- ## πŸ”— Quick Links | Resource | Link | |---|---| | πŸš€ **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) | | πŸ’» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) | | πŸ“’ **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) | | πŸ“’ **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) | | πŸ“ **Long-form blog post** | [`BLOG.md`](BLOG.md) | | πŸ“Š **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) | | πŸ“ˆ **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) | | 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) | | πŸŽ₯ **2-minute pitch video** | *(YouTube link coming soon)* | --- ## Why Theme 2 | Pillar | How we model it | |---|---| | **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β€” greedy local moves cannot capture it. | | **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. | | **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. | --- ## Real-World Motivation Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β€” partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β€” so an LLM can be trained, evaluated, and inspected on it end-to-end. For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md). --- ## Quickstart ```bash # Clone and install git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git cd Wildfire-Containment-Simulator uv pip install -r requirements.txt uv pip install -e . # Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds) python scripts/evaluate.py 5 # Compare agents head-to-head python scripts/eval_compare.py --seeds 42 43 44 45 46 \ --tiers easy medium hard --agents random heuristic # Render an episode as a GIF python scripts/replay.py --tier medium --seed 42 \ --agent heuristic --output demos/replay.gif # Spin up the OpenEnv FastAPI server locally on port 7860 python server/app.py # Then visit http://localhost:7860/ui/ for the interactive frontend ``` Full test suite: `pytest tests -v` (41 tests, ~30s on CPU). --- ## Live Hugging Face Space The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β€” no Python import needed: ```bash SPACE=https://eshit-wildfire-containment-simulator.hf.space curl "$SPACE/health" curl -X POST "$SPACE/reset?task_id=easy&seed=42" curl -X POST "$SPACE/step" -H "Content-Type: application/json" \ -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}' ``` Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend). --- ## Environment API ```python from env import WildfireEnv, Action, ActionType, Direction env = WildfireEnv() obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step) while not env.done: action = Action( action_type=ActionType.DEPLOY_CREW, crew_id="crew_0", target_row=7, target_col=7, ) result = env.step(action) # StepResult obs = result.observation reward = result.reward # decomposed float, range ~βˆ’8 to +8 done = result.done state = env.state() # Full ground truth (grading only) ``` `reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β€” agents must work from `Observation`. --- ## Action Space All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.** | Action | Required parameters | Description | |---|---|---| | `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell | | `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell | | `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy | | `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ—3 retardant drop with 5-step cooldown | | `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew | | `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ—10 area for 5 steps | | `idle` | `reason` *(optional)* | Explicitly wait | A 3-layer parser (`env/action_parser.py`) maps raw LLM output β†’ structured `Action`: direct JSON β†’ regex field extraction β†’ safe-`idle` fallback. **The environment loop never breaks on bad model output.** --- ## Observation Space | Component | Contents | Noise / occlusion | |---|---|---| | `briefing` | `OperationalBriefing` on first obs β€” incident ID, priority zones, infrastructure, wind forecast | First step only | | `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) | | `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard | | `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable | | `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable | | `recent_events` | Last 5 notable events | Fully observable | The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`. --- ## Reward Function Decomposed for GRPO β€” wide reward range produces meaningful advantages between rollout groups. **Per-step (dense):** ``` step_reward = 0.4 Β· Ξ”containment + 0.4 Β· Ξ”population_safety βˆ’ 0.1 Β· redundant_action_flag ``` **Terminal (sparse, on episode end):** ``` +5.0 if all populations safe +0–2.0 efficiency bonus (faster containment β‡’ more) +1.0 briefing-adherence bonus (all priority zones survived) βˆ’3.0 Β· (pop_lost / total_pop) if any population lost βˆ’2.0 if any crew casualty βˆ’0.01 Γ— invalid_action_count capped at βˆ’0.2 ``` Total empirical range: **βˆ’8 to +8**, declared in `openenv.yaml`. | Tier | Spread scale | Episode length | Approx. reward ceiling | |---|---|---|---| | Easy | 1.00Γ— | 80 | +8 | | Medium | 0.70Γ— | 150 | +7 | | Hard | 0.55Γ— | 300 | +6 | --- ## Three Difficulty Tiers ### Task 1 β€” Easy: Flatland Grass Fire 15Γ—15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control. ### Task 2 β€” Medium: Canyon Terrain with Wind Shifts 25Γ—25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure. ### Task 3 β€” Hard: Wildland-Urban Interface Crisis 40Γ—40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures. --- ## Fire Spread Model A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor: ``` P(ignite) = base_rate Γ— fuel_factor Γ— wind_factor Γ— slope_factor Γ— (1 βˆ’ moisture) Γ— (1 βˆ’ suppression) Γ— tier_scale ``` | Factor | Effect | |---|---| | `base_rate` | Baseline spread by fuel type | | `fuel_factor` | Fuel load of the target cell | | `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise | | `slope_factor` | Faster uphill, slower downhill | | `moisture` | Wet ground / recent rain reduces ignition probability | | `suppression` | Crew presence and retardant coverage reduce spread | | `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` | Burning cells progress through `BURNING β†’ EMBER β†’ BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability. --- ## Results > Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99). | Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) | |---|---|---|---| | Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 | | Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | **+4.74 Β± 3.79** | | **Trained Qwen-2.5-7B (ours)** | +5.13 Β± 3.90 | **+5.74 Β± 3.07** | +2.14 Β± 2.87 | | **Ξ” vs. Heuristic** | βˆ’2.41 | **βˆ’0.58 βœ“** | βˆ’2.59 | The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion). **Auxiliary metrics for the trained agent:** | Metric | Easy | Medium | Hard | |---|---|---|---| | JSON success rate | 98.5% | 99.8% | 99.2% | | Mean population saved % | 87% | 97% | 92% | **Curriculum progression:** easy (steps 0–52) β†’ medium (steps 53–62) β†’ hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps. > Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json). --- ## Training We use a two-stage recipe: 1. **SFT warm-up** β€” generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb). 2. **GRPO (TRL `GRPOTrainer`)** β€” start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy β†’ medium β†’ hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb). **Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time. **Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`. **Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git β€” generate with `python scripts/plot_grpo_training.py`). For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md). --- ## Project Structure ```text Wildfire-Containment-Simulator/ β”œβ”€β”€ env/ β”‚ β”œβ”€β”€ wildfire_env.py # Main env: reset(), step(), state() β”‚ β”œβ”€β”€ models.py # Pydantic action/observation/state models β”‚ β”œβ”€β”€ grid.py # Terrain, smoke, moisture, fog-of-war β”‚ β”œβ”€β”€ fire_spread.py # Cellular automaton fire propagation β”‚ β”œβ”€β”€ weather.py # Stochastic weather engine β”‚ β”œβ”€β”€ resources.py # Crews, tankers, firebreaks, recon β”‚ β”œβ”€β”€ reward.py # Decomposed step + terminal reward β”‚ β”œβ”€β”€ briefing.py # OperationalBriefing generation β”‚ β”œβ”€β”€ serialization.py # Observation β†’ LLM prompt β”‚ β”œβ”€β”€ action_parser.py # LLM output β†’ Action (3-layer fallback) β”‚ β”œβ”€β”€ rendering.py # Frame rendering for GIF replays β”‚ └── curriculum.py # CurriculumController (auto-promote/demote) β”œβ”€β”€ agents/ β”‚ β”œβ”€β”€ random_agent.py β”‚ └── heuristic_agent.py β”œβ”€β”€ graders/ β”‚ β”œβ”€β”€ grader_easy.py # β†’ (total_reward, details_dict) β”‚ β”œβ”€β”€ grader_medium.py β”‚ └── grader_hard.py β”œβ”€β”€ scripts/ β”‚ β”œβ”€β”€ evaluate.py # Baseline eval (random + heuristic) β”‚ β”œβ”€β”€ eval_compare.py # Multi-agent comparison β”‚ β”œβ”€β”€ eval_trained_model.py # Evaluate a trained adapter β”‚ β”œβ”€β”€ generate_sft_data.py # Build SFT dataset from heuristic rollouts β”‚ β”œβ”€β”€ replay.py # Render episode as GIF β”‚ β”œβ”€β”€ run_demo.py # Pitch demo β”‚ └── plot_dashboard.py # 4-panel training curves β”œβ”€β”€ training/ β”‚ β”œβ”€β”€ grpo_v2_colab.ipynb # GRPO notebook (canonical) β”‚ β”œβ”€β”€ sft_colab.ipynb # SFT warm-up notebook β”‚ β”œβ”€β”€ sft_data.jsonl # 4,300 SFT examples β”‚ β”œβ”€β”€ requirements.txt # Training deps (Unsloth, TRL, etc.) β”‚ └── README.md β”œβ”€β”€ server/ β”‚ └── app.py # FastAPI on port 7860 β”œβ”€β”€ frontend/ # Interactive HTML/JS frontend served at /ui/ β”œβ”€β”€ tests/ # 41 pytest tests β”œβ”€β”€ demos/ # GIF/PNG demo assets β”œβ”€β”€ openenv.yaml # OpenEnv environment manifest β”œβ”€β”€ Dockerfile # HF Space build β”œβ”€β”€ BLOG.md # Long-form write-up └── README.md # You are here ``` --- ## Architecture Decisions 1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, βˆ’3 Γ— loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups. 2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic β€” protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational. 3. **Two-stage training (SFT β†’ GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments. 4. **3-layer action parser.** JSON parse β†’ regex fallback β†’ safe-`idle`. The training loop never breaks on malformed model output. 5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) β†’ "What broke"). 6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem β€” every run is byte-for-byte reproducible. 7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent β€” TRL, vLLM, an OpenAI-compatible API client, a curl loop β€” can drive it. --- ## Citation If you use this environment, please cite: ```bibtex @misc{wildfire-containment-simulator-2026, title = {Wildfire Containment Simulator: Long-Horizon Planning and Instruction Following for Disaster-Response LLM Agents}, author = {Team Wildfire}, year = {2026}, url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator}, note = {Meta OpenEnv Hackathon submission, Theme 2} } ``` --- ## License [MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta Γ— Hugging Face Γ— Scaler hackathon, April 2026.