| ---
|
| title: Wildfire Containment Simulator
|
| emoji: π₯
|
| colorFrom: red
|
| colorTo: purple
|
| sdk: docker
|
| pinned: false
|
| license: mit
|
| tags:
|
| - reinforcement-learning
|
| - simulation
|
| - openenv
|
| - wildfire
|
| - rl-environment
|
| - long-horizon
|
| - instruction-following
|
| ---
|
|
|
| # Wildfire Containment Simulator
|
|
|
| **Meta OpenEnv Hackathon β Theme 2: Long-Horizon Planning & Instruction Following**
|
|
|
| 
|
| 
|
| 
|
| 
|
| 
|
|
|
| A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80β300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
|
|
|
| > **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier β vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy β medium β hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout.
|
| > *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)*
|
|
|
| ---
|
|
|
| ## π Quick Links
|
|
|
| | Resource | Link |
|
| |---|---|
|
| | π **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
|
| | π» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
|
| | π **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
|
| | π **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
|
| | π **Long-form blog post** | [`BLOG.md`](BLOG.md) |
|
| | π **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
|
| | π **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) |
|
| | π¬ **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
|
| | π₯ **2-minute pitch video** | *(YouTube link coming soon)* |
|
|
|
| ---
|
|
|
| ## Why Theme 2
|
|
|
| | Pillar | How we model it |
|
| |---|---|
|
| | **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β greedy local moves cannot capture it. |
|
| | **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
|
| | **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
|
|
|
| ---
|
|
|
| ## Real-World Motivation
|
|
|
| Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β so an LLM can be trained, evaluated, and inspected on it end-to-end.
|
|
|
| For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
|
|
|
| ---
|
|
|
| ## Quickstart
|
|
|
| ```bash
|
| # Clone and install
|
| git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
|
| cd Wildfire-Containment-Simulator
|
| uv pip install -r requirements.txt
|
| uv pip install -e .
|
|
|
| # Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
|
| python scripts/evaluate.py 5
|
|
|
| # Compare agents head-to-head
|
| python scripts/eval_compare.py --seeds 42 43 44 45 46 \
|
| --tiers easy medium hard --agents random heuristic
|
|
|
| # Render an episode as a GIF
|
| python scripts/replay.py --tier medium --seed 42 \
|
| --agent heuristic --output demos/replay.gif
|
|
|
| # Spin up the OpenEnv FastAPI server locally on port 7860
|
| python server/app.py
|
| # Then visit http://localhost:7860/ui/ for the interactive frontend
|
| ```
|
|
|
| Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
|
|
|
| ---
|
|
|
| ## Live Hugging Face Space
|
|
|
| The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β no Python import needed:
|
|
|
| ```bash
|
| SPACE=https://eshit-wildfire-containment-simulator.hf.space
|
|
|
| curl "$SPACE/health"
|
| curl -X POST "$SPACE/reset?task_id=easy&seed=42"
|
| curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
|
| -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
|
| ```
|
|
|
| Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
|
|
|
| ---
|
|
|
| ## Environment API
|
|
|
| ```python
|
| from env import WildfireEnv, Action, ActionType, Direction
|
|
|
| env = WildfireEnv()
|
| obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
|
|
|
| while not env.done:
|
| action = Action(
|
| action_type=ActionType.DEPLOY_CREW,
|
| crew_id="crew_0",
|
| target_row=7, target_col=7,
|
| )
|
| result = env.step(action) # StepResult
|
| obs = result.observation
|
| reward = result.reward # decomposed float, range ~β8 to +8
|
| done = result.done
|
|
|
| state = env.state() # Full ground truth (grading only)
|
| ```
|
|
|
| `reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β agents must work from `Observation`.
|
|
|
| ---
|
|
|
| ## Action Space
|
|
|
| All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
|
|
|
| | Action | Required parameters | Description |
|
| |---|---|---|
|
| | `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
|
| | `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
|
| | `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
|
| | `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ3 retardant drop with 5-step cooldown |
|
| | `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
|
| | `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ10 area for 5 steps |
|
| | `idle` | `reason` *(optional)* | Explicitly wait |
|
|
|
| A 3-layer parser (`env/action_parser.py`) maps raw LLM output β structured `Action`: direct JSON β regex field extraction β safe-`idle` fallback. **The environment loop never breaks on bad model output.**
|
|
|
| ---
|
|
|
| ## Observation Space
|
|
|
| | Component | Contents | Noise / occlusion |
|
| |---|---|---|
|
| | `briefing` | `OperationalBriefing` on first obs β incident ID, priority zones, infrastructure, wind forecast | First step only |
|
| | `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
|
| | `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard |
|
| | `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
|
| | `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
|
| | `recent_events` | Last 5 notable events | Fully observable |
|
|
|
| The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
|
|
|
| ---
|
|
|
| ## Reward Function
|
|
|
| Decomposed for GRPO β wide reward range produces meaningful advantages between rollout groups.
|
|
|
| **Per-step (dense):**
|
| ```
|
| step_reward = 0.4 Β· Ξcontainment + 0.4 Β· Ξpopulation_safety β 0.1 Β· redundant_action_flag
|
| ```
|
|
|
| **Terminal (sparse, on episode end):**
|
| ```
|
| +5.0 if all populations safe
|
| +0β2.0 efficiency bonus (faster containment β more)
|
| +1.0 briefing-adherence bonus (all priority zones survived)
|
| β3.0 Β· (pop_lost / total_pop) if any population lost
|
| β2.0 if any crew casualty
|
| β0.01 Γ invalid_action_count capped at β0.2
|
| ```
|
|
|
| Total empirical range: **β8 to +8**, declared in `openenv.yaml`.
|
|
|
| | Tier | Spread scale | Episode length | Approx. reward ceiling |
|
| |---|---|---|---|
|
| | Easy | 1.00Γ | 80 | +8 |
|
| | Medium | 0.70Γ | 150 | +7 |
|
| | Hard | 0.55Γ | 300 | +6 |
|
|
|
| ---
|
|
|
| ## Three Difficulty Tiers
|
|
|
| ### Task 1 β Easy: Flatland Grass Fire
|
| 15Γ15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control.
|
|
|
| ### Task 2 β Medium: Canyon Terrain with Wind Shifts
|
| 25Γ25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
|
|
|
| ### Task 3 β Hard: Wildland-Urban Interface Crisis
|
| 40Γ40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
|
|
|
| ---
|
|
|
| ## Fire Spread Model
|
|
|
| A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
|
|
|
| ```
|
| P(ignite) = base_rate Γ fuel_factor Γ wind_factor Γ slope_factor
|
| Γ (1 β moisture) Γ (1 β suppression) Γ tier_scale
|
| ```
|
|
|
| | Factor | Effect |
|
| |---|---|
|
| | `base_rate` | Baseline spread by fuel type |
|
| | `fuel_factor` | Fuel load of the target cell |
|
| | `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
|
| | `slope_factor` | Faster uphill, slower downhill |
|
| | `moisture` | Wet ground / recent rain reduces ignition probability |
|
| | `suppression` | Crew presence and retardant coverage reduce spread |
|
| | `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
|
|
|
| Burning cells progress through `BURNING β EMBER β BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
|
|
|
| ---
|
|
|
| ## Results
|
|
|
| > Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42β46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42β56 (15 per tier, no overlap with training seeds 0β99).
|
|
|
| | Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
|
| |---|---|---|---|
|
| | Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
|
| | Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | **+4.74 Β± 3.79** |
|
| | **Trained Qwen-2.5-7B (ours)** | +5.13 Β± 3.90 | **+5.74 Β± 3.07** | +2.14 Β± 2.87 |
|
| | **Ξ vs. Heuristic** | β2.41 | **β0.58 β** | β2.59 |
|
|
|
| The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion).
|
|
|
| **Auxiliary metrics for the trained agent:**
|
|
|
| | Metric | Easy | Medium | Hard |
|
| |---|---|---|---|
|
| | JSON success rate | 98.5% | 99.8% | 99.2% |
|
| | Mean population saved % | 87% | 97% | 92% |
|
|
|
| **Curriculum progression:** easy (steps 0β52) β medium (steps 53β62) β hard (steps 63β149). The model reached hard tier in just 63 of 150 training steps.
|
|
|
| > Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).
|
|
|
| ---
|
|
|
| ## Training
|
|
|
| We use a two-stage recipe:
|
|
|
| 1. **SFT warm-up** β generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
|
| 2. **GRPO (TRL `GRPOTrainer`)** β start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy β medium β hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
|
|
|
| **Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
|
| **Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.
|
|
|
| **Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git β generate with `python scripts/plot_grpo_training.py`).
|
|
|
| For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
|
|
|
| ---
|
|
|
| ## Project Structure
|
|
|
| ```text
|
| Wildfire-Containment-Simulator/
|
| βββ env/
|
| β βββ wildfire_env.py # Main env: reset(), step(), state()
|
| β βββ models.py # Pydantic action/observation/state models
|
| β βββ grid.py # Terrain, smoke, moisture, fog-of-war
|
| β βββ fire_spread.py # Cellular automaton fire propagation
|
| β βββ weather.py # Stochastic weather engine
|
| β βββ resources.py # Crews, tankers, firebreaks, recon
|
| β βββ reward.py # Decomposed step + terminal reward
|
| β βββ briefing.py # OperationalBriefing generation
|
| β βββ serialization.py # Observation β LLM prompt
|
| β βββ action_parser.py # LLM output β Action (3-layer fallback)
|
| β βββ rendering.py # Frame rendering for GIF replays
|
| β βββ curriculum.py # CurriculumController (auto-promote/demote)
|
| βββ agents/
|
| β βββ random_agent.py
|
| β βββ heuristic_agent.py
|
| βββ graders/
|
| β βββ grader_easy.py # β (total_reward, details_dict)
|
| β βββ grader_medium.py
|
| β βββ grader_hard.py
|
| βββ scripts/
|
| β βββ evaluate.py # Baseline eval (random + heuristic)
|
| β βββ eval_compare.py # Multi-agent comparison
|
| β βββ eval_trained_model.py # Evaluate a trained adapter
|
| β βββ generate_sft_data.py # Build SFT dataset from heuristic rollouts
|
| β βββ replay.py # Render episode as GIF
|
| β βββ run_demo.py # Pitch demo
|
| β βββ plot_dashboard.py # 4-panel training curves
|
| βββ training/
|
| β βββ grpo_v2_colab.ipynb # GRPO notebook (canonical)
|
| β βββ sft_colab.ipynb # SFT warm-up notebook
|
| β βββ sft_data.jsonl # 4,300 SFT examples
|
| β βββ requirements.txt # Training deps (Unsloth, TRL, etc.)
|
| β βββ README.md
|
| βββ server/
|
| β βββ app.py # FastAPI on port 7860
|
| βββ frontend/ # Interactive HTML/JS frontend served at /ui/
|
| βββ tests/ # 41 pytest tests
|
| βββ demos/ # GIF/PNG demo assets
|
| βββ openenv.yaml # OpenEnv environment manifest
|
| βββ Dockerfile # HF Space build
|
| βββ BLOG.md # Long-form write-up
|
| βββ README.md # You are here
|
| ```
|
|
|
| ---
|
|
|
| ## Architecture Decisions
|
|
|
| 1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, β3 Γ loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
|
| 2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic β protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
|
| 3. **Two-stage training (SFT β GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
|
| 4. **3-layer action parser.** JSON parse β regex fallback β safe-`idle`. The training loop never breaks on malformed model output.
|
| 5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) β "What broke").
|
| 6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem β every run is byte-for-byte reproducible.
|
| 7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent β TRL, vLLM, an OpenAI-compatible API client, a curl loop β can drive it.
|
|
|
| ---
|
|
|
| ## Citation
|
|
|
| If you use this environment, please cite:
|
|
|
| ```bibtex
|
| @misc{wildfire-containment-simulator-2026,
|
| title = {Wildfire Containment Simulator: Long-Horizon Planning and
|
| Instruction Following for Disaster-Response LLM Agents},
|
| author = {Team Wildfire},
|
| year = {2026},
|
| url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
|
| note = {Meta OpenEnv Hackathon submission, Theme 2}
|
| }
|
| ```
|
|
|
| ---
|
|
|
| ## License
|
|
|
| [MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta Γ Hugging Face Γ Scaler hackathon, April 2026.
|
|
|