Eshit's picture
Update README
d377d79 verified
---
title: Wildfire Containment Simulator
emoji: πŸ”₯
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
license: mit
tags:
- reinforcement-learning
- simulation
- openenv
- wildfire
- rl-environment
- long-horizon
- instruction-following
---
# Wildfire Containment Simulator
**Meta OpenEnv Hackathon β€” Theme 2: Long-Horizon Planning & Instruction Following**
![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
![Python](https://img.shields.io/badge/Python-3.11+-blue)
![License](https://img.shields.io/badge/License-MIT-green)
A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier β€” vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy β†’ medium β†’ hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout.
> *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)*
---
## πŸ”— Quick Links
| Resource | Link |
|---|---|
| πŸš€ **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
| πŸ’» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
| πŸ“’ **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
| πŸ“’ **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
| πŸ“ **Long-form blog post** | [`BLOG.md`](BLOG.md) |
| πŸ“Š **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
| πŸ“ˆ **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) |
| 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
| πŸŽ₯ **2-minute pitch video** | *(YouTube link coming soon)* |
---
## Why Theme 2
| Pillar | How we model it |
|---|---|
| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β€” greedy local moves cannot capture it. |
| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
---
## Real-World Motivation
Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β€” partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β€” so an LLM can be trained, evaluated, and inspected on it end-to-end.
For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
---
## Quickstart
```bash
# Clone and install
git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
cd Wildfire-Containment-Simulator
uv pip install -r requirements.txt
uv pip install -e .
# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
python scripts/evaluate.py 5
# Compare agents head-to-head
python scripts/eval_compare.py --seeds 42 43 44 45 46 \
--tiers easy medium hard --agents random heuristic
# Render an episode as a GIF
python scripts/replay.py --tier medium --seed 42 \
--agent heuristic --output demos/replay.gif
# Spin up the OpenEnv FastAPI server locally on port 7860
python server/app.py
# Then visit http://localhost:7860/ui/ for the interactive frontend
```
Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
---
## Live Hugging Face Space
The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β€” no Python import needed:
```bash
SPACE=https://eshit-wildfire-containment-simulator.hf.space
curl "$SPACE/health"
curl -X POST "$SPACE/reset?task_id=easy&seed=42"
curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
-d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
```
Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
---
## Environment API
```python
from env import WildfireEnv, Action, ActionType, Direction
env = WildfireEnv()
obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
while not env.done:
action = Action(
action_type=ActionType.DEPLOY_CREW,
crew_id="crew_0",
target_row=7, target_col=7,
)
result = env.step(action) # StepResult
obs = result.observation
reward = result.reward # decomposed float, range ~βˆ’8 to +8
done = result.done
state = env.state() # Full ground truth (grading only)
```
`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β€” agents must work from `Observation`.
---
## Action Space
All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
| Action | Required parameters | Description |
|---|---|---|
| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ—3 retardant drop with 5-step cooldown |
| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
| `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ—10 area for 5 steps |
| `idle` | `reason` *(optional)* | Explicitly wait |
A 3-layer parser (`env/action_parser.py`) maps raw LLM output β†’ structured `Action`: direct JSON β†’ regex field extraction β†’ safe-`idle` fallback. **The environment loop never breaks on bad model output.**
---
## Observation Space
| Component | Contents | Noise / occlusion |
|---|---|---|
| `briefing` | `OperationalBriefing` on first obs β€” incident ID, priority zones, infrastructure, wind forecast | First step only |
| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard |
| `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
| `recent_events` | Last 5 notable events | Fully observable |
The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
---
## Reward Function
Decomposed for GRPO β€” wide reward range produces meaningful advantages between rollout groups.
**Per-step (dense):**
```
step_reward = 0.4 Β· Ξ”containment + 0.4 Β· Ξ”population_safety βˆ’ 0.1 Β· redundant_action_flag
```
**Terminal (sparse, on episode end):**
```
+5.0 if all populations safe
+0–2.0 efficiency bonus (faster containment β‡’ more)
+1.0 briefing-adherence bonus (all priority zones survived)
βˆ’3.0 Β· (pop_lost / total_pop) if any population lost
βˆ’2.0 if any crew casualty
βˆ’0.01 Γ— invalid_action_count capped at βˆ’0.2
```
Total empirical range: **βˆ’8 to +8**, declared in `openenv.yaml`.
| Tier | Spread scale | Episode length | Approx. reward ceiling |
|---|---|---|---|
| Easy | 1.00Γ— | 80 | +8 |
| Medium | 0.70Γ— | 150 | +7 |
| Hard | 0.55Γ— | 300 | +6 |
---
## Three Difficulty Tiers
### Task 1 β€” Easy: Flatland Grass Fire
15Γ—15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control.
### Task 2 β€” Medium: Canyon Terrain with Wind Shifts
25Γ—25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
### Task 3 β€” Hard: Wildland-Urban Interface Crisis
40Γ—40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
---
## Fire Spread Model
A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
```
P(ignite) = base_rate Γ— fuel_factor Γ— wind_factor Γ— slope_factor
Γ— (1 βˆ’ moisture) Γ— (1 βˆ’ suppression) Γ— tier_scale
```
| Factor | Effect |
|---|---|
| `base_rate` | Baseline spread by fuel type |
| `fuel_factor` | Fuel load of the target cell |
| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
| `slope_factor` | Faster uphill, slower downhill |
| `moisture` | Wet ground / recent rain reduces ignition probability |
| `suppression` | Crew presence and retardant coverage reduce spread |
| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
Burning cells progress through `BURNING β†’ EMBER β†’ BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
---
## Results
> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99).
| Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
|---|---|---|---|
| Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
| Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | **+4.74 Β± 3.79** |
| **Trained Qwen-2.5-7B (ours)** | +5.13 Β± 3.90 | **+5.74 Β± 3.07** | +2.14 Β± 2.87 |
| **Ξ” vs. Heuristic** | βˆ’2.41 | **βˆ’0.58 βœ“** | βˆ’2.59 |
The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion).
**Auxiliary metrics for the trained agent:**
| Metric | Easy | Medium | Hard |
|---|---|---|---|
| JSON success rate | 98.5% | 99.8% | 99.2% |
| Mean population saved % | 87% | 97% | 92% |
**Curriculum progression:** easy (steps 0–52) β†’ medium (steps 53–62) β†’ hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps.
> Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).
---
## Training
We use a two-stage recipe:
1. **SFT warm-up** β€” generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
2. **GRPO (TRL `GRPOTrainer`)** β€” start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy β†’ medium β†’ hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
**Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
**Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.
**Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git β€” generate with `python scripts/plot_grpo_training.py`).
For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
---
## Project Structure
```text
Wildfire-Containment-Simulator/
β”œβ”€β”€ env/
β”‚ β”œβ”€β”€ wildfire_env.py # Main env: reset(), step(), state()
β”‚ β”œβ”€β”€ models.py # Pydantic action/observation/state models
β”‚ β”œβ”€β”€ grid.py # Terrain, smoke, moisture, fog-of-war
β”‚ β”œβ”€β”€ fire_spread.py # Cellular automaton fire propagation
β”‚ β”œβ”€β”€ weather.py # Stochastic weather engine
β”‚ β”œβ”€β”€ resources.py # Crews, tankers, firebreaks, recon
β”‚ β”œβ”€β”€ reward.py # Decomposed step + terminal reward
β”‚ β”œβ”€β”€ briefing.py # OperationalBriefing generation
β”‚ β”œβ”€β”€ serialization.py # Observation β†’ LLM prompt
β”‚ β”œβ”€β”€ action_parser.py # LLM output β†’ Action (3-layer fallback)
β”‚ β”œβ”€β”€ rendering.py # Frame rendering for GIF replays
β”‚ └── curriculum.py # CurriculumController (auto-promote/demote)
β”œβ”€β”€ agents/
β”‚ β”œβ”€β”€ random_agent.py
β”‚ └── heuristic_agent.py
β”œβ”€β”€ graders/
β”‚ β”œβ”€β”€ grader_easy.py # β†’ (total_reward, details_dict)
β”‚ β”œβ”€β”€ grader_medium.py
β”‚ └── grader_hard.py
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ evaluate.py # Baseline eval (random + heuristic)
β”‚ β”œβ”€β”€ eval_compare.py # Multi-agent comparison
β”‚ β”œβ”€β”€ eval_trained_model.py # Evaluate a trained adapter
β”‚ β”œβ”€β”€ generate_sft_data.py # Build SFT dataset from heuristic rollouts
β”‚ β”œβ”€β”€ replay.py # Render episode as GIF
β”‚ β”œβ”€β”€ run_demo.py # Pitch demo
β”‚ └── plot_dashboard.py # 4-panel training curves
β”œβ”€β”€ training/
β”‚ β”œβ”€β”€ grpo_v2_colab.ipynb # GRPO notebook (canonical)
β”‚ β”œβ”€β”€ sft_colab.ipynb # SFT warm-up notebook
β”‚ β”œβ”€β”€ sft_data.jsonl # 4,300 SFT examples
β”‚ β”œβ”€β”€ requirements.txt # Training deps (Unsloth, TRL, etc.)
β”‚ └── README.md
β”œβ”€β”€ server/
β”‚ └── app.py # FastAPI on port 7860
β”œβ”€β”€ frontend/ # Interactive HTML/JS frontend served at /ui/
β”œβ”€β”€ tests/ # 41 pytest tests
β”œβ”€β”€ demos/ # GIF/PNG demo assets
β”œβ”€β”€ openenv.yaml # OpenEnv environment manifest
β”œβ”€β”€ Dockerfile # HF Space build
β”œβ”€β”€ BLOG.md # Long-form write-up
└── README.md # You are here
```
---
## Architecture Decisions
1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, βˆ’3 Γ— loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic β€” protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
3. **Two-stage training (SFT β†’ GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
4. **3-layer action parser.** JSON parse β†’ regex fallback β†’ safe-`idle`. The training loop never breaks on malformed model output.
5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) β†’ "What broke").
6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem β€” every run is byte-for-byte reproducible.
7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent β€” TRL, vLLM, an OpenAI-compatible API client, a curl loop β€” can drive it.
---
## Citation
If you use this environment, please cite:
```bibtex
@misc{wildfire-containment-simulator-2026,
title = {Wildfire Containment Simulator: Long-Horizon Planning and
Instruction Following for Disaster-Response LLM Agents},
author = {Team Wildfire},
year = {2026},
url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
note = {Meta OpenEnv Hackathon submission, Theme 2}
}
```
---
## License
[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta Γ— Hugging Face Γ— Scaler hackathon, April 2026.