Spaces:

Eshit
/

Wildfire-Containment-Simulator

Running

App Files Files Community

Eshit commited on Apr 26

Commit

d377d79

verified ·

1 Parent(s): 356f98f

Update README

Browse files

Files changed (1) hide show

README.md +364 -361

README.md CHANGED Viewed

@@ -1,361 +1,364 @@
----
-title: Wildfire Containment Simulator
-emoji: 🔥
-colorFrom: red
-colorTo: purple
-sdk: docker
-pinned: false
-license: mit
-tags:
-  - reinforcement-learning
-  - simulation
-  - openenv
-  - wildfire
-  - rl-environment
-  - long-horizon
-  - instruction-following
----
-# Wildfire Containment Simulator
-**Meta OpenEnv Hackathon — Theme 2: Long-Horizon Planning & Instruction Following**
-![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
-![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
-![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
-![Python](https://img.shields.io/badge/Python-3.11+-blue)
-![License](https://img.shields.io/badge/License-MIT-green)
-A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
-> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **{TBD}** on Hard tier — vs. **+4.74** for the rule-based heuristic and **+2.16** for the random baseline.
-> *(Numbers are filled in after `scripts/eval_trained_model.py` completes; see [Results](#results).)*
----
-## 🔗 Quick Links
-| Resource | Link |
-|---|---|
-| 🚀 **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
-| 💻 **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
-| 📒 **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
-| 📒 **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
-| 📝 **Long-form blog post** | [`BLOG.md`](BLOG.md) |
-| 📊 **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
-| 📈 **Training dashboard** | [`training/training_dashboard.png`](training/training_dashboard.png) *(generated post-run)* |
-| 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
-| 🎥 **2-minute pitch video** | *(YouTube link coming soon)* |
----
-## Why Theme 2
-| Pillar | How we model it |
-|---|---|
-| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival — greedy local moves cannot capture it. |
-| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
-| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
----
-## Real-World Motivation
-Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work — partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety — so an LLM can be trained, evaluated, and inspected on it end-to-end.
-For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
----
-## Quickstart
-```bash
-# Clone and install
-git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
-cd Wildfire-Containment-Simulator
-uv pip install -r requirements.txt
-uv pip install -e .
-# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
-python scripts/evaluate.py 5
-# Compare agents head-to-head
-python scripts/eval_compare.py --seeds 42 43 44 45 46 \
-    --tiers easy medium hard --agents random heuristic
-# Render an episode as a GIF
-python scripts/replay.py --tier medium --seed 42 \
-    --agent heuristic --output demos/replay.gif
-# Spin up the OpenEnv FastAPI server locally on port 7860
-python server/app.py
-# Then visit http://localhost:7860/ui/ for the interactive frontend
-```
-Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
----
-## Live Hugging Face Space
-The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP — no Python import needed:
-```bash
-SPACE=https://eshit-wildfire-containment-simulator.hf.space
-curl "$SPACE/health"
-curl -X POST "$SPACE/reset?task_id=easy&seed=42"
-curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
-    -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
-```
-Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
----
-## Environment API
-```python
-from env import WildfireEnv, Action, ActionType, Direction
-env = WildfireEnv()
-obs = env.reset(task_id="easy", seed=42)   # Observation (with OperationalBriefing on first step)
-while not env.done:
-    action = Action(
-        action_type=ActionType.DEPLOY_CREW,
-        crew_id="crew_0",
-        target_row=7, target_col=7,
-    )
-    result = env.step(action)               # StepResult
-    obs = result.observation
-    reward = result.reward                  # decomposed float, range ~−8 to +8
-    done = result.done
-state = env.state()                          # Full ground truth (grading only)
-```
-`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders — agents must work from `Observation`.
----
-## Action Space
-All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
-| Action | Required parameters | Description |
-|---|---|---|
-| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
-| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
-| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
-| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3×3 retardant drop with 5-step cooldown |
-| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
-| `recon_flight` | `target_row`, `target_col` | Reveal a 10×10 area for 5 steps |
-| `idle` | `reason` *(optional)* | Explicitly wait |
-A 3-layer parser (`env/action_parser.py`) maps raw LLM output → structured `Action`: direct JSON → regex field extraction → safe-`idle` fallback. **The environment loop never breaks on bad model output.**
----
-## Observation Space
-| Component | Contents | Noise / occlusion |
-|---|---|---|
-| `briefing` | `OperationalBriefing` on first obs — incident ID, priority zones, infrastructure, wind forecast | First step only |
-| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
-| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | ±5 km/h, ±20° on medium/hard |
-| `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
-| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
-| `recent_events` | Last 5 notable events | Fully observable |
-The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
----
-## Reward Function
-Decomposed for GRPO — wide reward range produces meaningful advantages between rollout groups.
-**Per-step (dense):**
-```
-step_reward = 0.4 · Δcontainment + 0.4 · Δpopulation_safety − 0.1 · redundant_action_flag
-```
-**Terminal (sparse, on episode end):**
-```
-+5.0   if all populations safe
-+0–2.0 efficiency bonus (faster containment ⇒ more)
-+1.0   briefing-adherence bonus (all priority zones survived)
-−3.0 · (pop_lost / total_pop)   if any population lost
-−2.0   if any crew casualty
-−0.01 × invalid_action_count    capped at −0.2
-```
-Total empirical range: **−8 to +8**, declared in `openenv.yaml`.
-| Tier | Spread scale | Episode length | Approx. reward ceiling |
-|---|---|---|---|
-| Easy | 1.00× | 80 | +8 |
-| Medium | 0.70× | 150 | +7 |
-| Hard | 0.55× | 300 | +6 |
----
-## Three Difficulty Tiers
-### Task 1 — Easy: Flatland Grass Fire
-15×15 flat grid · single ignition · constant wind · no smoke or fog-of-war · 4 crews, 1 tanker, 15 firebreak cells · 80 steps. **Focus:** basic deployment and perimeter control.
-### Task 2 — Medium: Canyon Terrain with Wind Shifts
-25×25 mixed terrain · two ignition points · variable wind · smoke occlusion · sensor noise · 5 crews, 2 tankers, 20 firebreak cells, 1 recon · 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
-### Task 3 — Hard: Wildland-Urban Interface Crisis
-40×40 terrain with roads, rivers, urban zones · staggered ignitions (step 30) · scripted crew casualty (step 40) · fog-of-war (radius 7) · aggressive wind shifts · 6 crews, 3 tankers, 30 firebreak cells, 3 recon · 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
----
-## Fire Spread Model
-A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
-```
-P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
-            × (1 − moisture) × (1 − suppression) × tier_scale
-```
-| Factor | Effect |
-|---|---|
-| `base_rate` | Baseline spread by fuel type |
-| `fuel_factor` | Fuel load of the target cell |
-| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
-| `slope_factor` | Faster uphill, slower downhill |
-| `moisture` | Wet ground / recent rain reduces ignition probability |
-| `suppression` | Crew presence and retardant coverage reduce spread |
-| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
-Burning cells progress through `BURNING → EMBER → BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
----
-## Results
-> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers are produced by `python scripts/eval_trained_model.py --num-seeds 15` on held-out seeds 200–214 (no overlap with training seeds 0–99).
-| Agent | Easy (mean ± std) | Medium (mean ± std) | Hard (mean ± std) |
-|---|---|---|---|
-| Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
-| Heuristic | **+7.53 ± 0.08** | **+6.31 ± 2.77** | +4.74 ± 3.79 |
-| **Trained Qwen-2.5-7B (ours)** | **{TBD}** | **{TBD}** | **{TBD}** |
-| **Δ vs. Heuristic** | **{TBD}** | **{TBD}** | **{TBD}** |
-**Auxiliary metrics for the trained agent** (filled in post-eval):
-| Metric | Easy | Medium | Hard |
-|---|---|---|---|
-| JSON success rate | {TBD} | {TBD} | {TBD} |
-| Mean population saved % | {TBD} | {TBD} | {TBD} |
-| Crew casualty rate | {TBD} | {TBD} | {TBD} |
-> See `scripts/trained_results.json` (post-eval) for the raw scores.
----
-## Training
-We use a two-stage recipe:
-1. **SFT warm-up** — generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
-2. **GRPO (TRL `GRPOTrainer`)** — start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy → medium → hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
-**Hardware:** A10G Large (24 GB) on a Hugging Face Space JupyterLab session.
-**Training stack:** `unsloth` (4-bit QLoRA), `trl==0.15.2`, `datasets==3.4.1`, `transformers`, `peft`, `wandb`. Pinned in [`training/requirements.txt`](training/requirements.txt).
-**Training plots:** dashboard PNG at [`training/training_dashboard.png`](training/training_dashboard.png) (4-panel: episode reward, population-survival rate, containment %, curriculum tier timeline). W&B run: *(link added post-run)*.
-For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
----
-## Project Structure
-```text
-Wildfire-Containment-Simulator/
-├── env/
-│   ├── wildfire_env.py       # Main env: reset(), step(), state()
-│   ├── models.py             # Pydantic action/observation/state models
-│   ├── grid.py               # Terrain, smoke, moisture, fog-of-war
-│   ├── fire_spread.py        # Cellular automaton fire propagation
-│   ├── weather.py            # Stochastic weather engine
-│   ├── resources.py          # Crews, tankers, firebreaks, recon
-│   ├── reward.py             # Decomposed step + terminal reward
-│   ├── briefing.py           # OperationalBriefing generation
-│   ├── serialization.py      # Observation → LLM prompt
-│   ├── action_parser.py      # LLM output → Action (3-layer fallback)
-│   ├── rendering.py          # Frame rendering for GIF replays
-│   └── curriculum.py         # CurriculumController (auto-promote/demote)
-├── agents/
-│   ├── random_agent.py
-│   └── heuristic_agent.py
-├── graders/
-│   ├── grader_easy.py        # → (total_reward, details_dict)
-│   ├── grader_medium.py
-│   └── grader_hard.py
-├── scripts/
-│   ├── evaluate.py           # Baseline eval (random + heuristic)
-│   ├── eval_compare.py       # Multi-agent comparison
-│   ├── eval_trained_model.py # Evaluate a trained adapter
-│   ├── generate_sft_data.py  # Build SFT dataset from heuristic rollouts
-│   ├── replay.py             # Render episode as GIF
-│   ├── run_demo.py           # Pitch demo
-│   └── plot_dashboard.py     # 4-panel training curves
-├── training/
-│   ├── grpo_v2_colab.ipynb   # GRPO notebook (canonical)
-│   ├── sft_colab.ipynb       # SFT warm-up notebook
-│   ├── sft_data.jsonl        # 4,300 SFT examples
-│   ├── requirements.txt      # Training deps (Unsloth, TRL, etc.)
-│   └── README.md
-├── server/
-│   └── app.py                # FastAPI on port 7860
-├── frontend/                 # Interactive HTML/JS frontend served at /ui/
-├── tests/                    # 41 pytest tests
-├── demos/                    # GIF/PNG demo assets
-├── openenv.yaml              # OpenEnv environment manifest
-├── Dockerfile                # HF Space build
-├── BLOG.md                   # Long-form write-up
-└── README.md                 # You are here
-```
----
-## Architecture Decisions
-1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, −3 × loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
-2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic — protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
-3. **Two-stage training (SFT → GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
-4. **3-layer action parser.** JSON parse → regex fallback → safe-`idle`. The training loop never breaks on malformed model output.
-5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) → "What broke").
-6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem — every run is byte-for-byte reproducible.
-7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent — TRL, vLLM, an OpenAI-compatible API client, a curl loop — can drive it.
----
-## Citation
-If you use this environment, please cite:
-```bibtex
-@misc{wildfire-containment-simulator-2026,
-  title  = {Wildfire Containment Simulator: Long-Horizon Planning and
-            Instruction Following for Disaster-Response LLM Agents},
-  author = {Team Wildfire},
-  year   = {2026},
-  url    = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
-  note   = {Meta OpenEnv Hackathon submission, Theme 2}
-}
-```
----
-## License
-[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta × Hugging Face × Scaler hackathon, April 2026.

+---
+title: Wildfire Containment Simulator
+emoji: 🔥
+colorFrom: red
+colorTo: purple
+sdk: docker
+pinned: false
+license: mit
+tags:
+  - reinforcement-learning
+  - simulation
+  - openenv
+  - wildfire
+  - rl-environment
+  - long-horizon
+  - instruction-following
+---
+# Wildfire Containment Simulator
+**Meta OpenEnv Hackathon — Theme 2: Long-Horizon Planning & Instruction Following**
+![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
+![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
+![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
+![Python](https://img.shields.io/badge/Python-3.11+-blue)
+![License](https://img.shields.io/badge/License-MIT-green)
+A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
+> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier — vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy → medium → hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout.
+> *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)*
+---
+## 🔗 Quick Links
+| Resource | Link |
+|---|---|
+| 🚀 **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
+| 💻 **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
+| 📒 **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
+| 📒 **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
+| 📝 **Long-form blog post** | [`BLOG.md`](BLOG.md) |
+| 📊 **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
+| 📈 **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) |
+| 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
+| 🎥 **2-minute pitch video** | *(YouTube link coming soon)* |
+---
+## Why Theme 2
+| Pillar | How we model it |
+|---|---|
+| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival — greedy local moves cannot capture it. |
+| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
+| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
+---
+## Real-World Motivation
+Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work — partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety — so an LLM can be trained, evaluated, and inspected on it end-to-end.
+For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
+---
+## Quickstart
+```bash
+# Clone and install
+git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
+cd Wildfire-Containment-Simulator
+uv pip install -r requirements.txt
+uv pip install -e .
+# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
+python scripts/evaluate.py 5
+# Compare agents head-to-head
+python scripts/eval_compare.py --seeds 42 43 44 45 46 \
+    --tiers easy medium hard --agents random heuristic
+# Render an episode as a GIF
+python scripts/replay.py --tier medium --seed 42 \
+    --agent heuristic --output demos/replay.gif
+# Spin up the OpenEnv FastAPI server locally on port 7860
+python server/app.py
+# Then visit http://localhost:7860/ui/ for the interactive frontend
+```
+Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
+---
+## Live Hugging Face Space
+The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP — no Python import needed:
+```bash
+SPACE=https://eshit-wildfire-containment-simulator.hf.space
+curl "$SPACE/health"
+curl -X POST "$SPACE/reset?task_id=easy&seed=42"
+curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
+    -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
+```
+Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
+---
+## Environment API
+```python
+from env import WildfireEnv, Action, ActionType, Direction
+env = WildfireEnv()
+obs = env.reset(task_id="easy", seed=42)   # Observation (with OperationalBriefing on first step)
+while not env.done:
+    action = Action(
+        action_type=ActionType.DEPLOY_CREW,
+        crew_id="crew_0",
+        target_row=7, target_col=7,
+    )
+    result = env.step(action)               # StepResult
+    obs = result.observation
+    reward = result.reward                  # decomposed float, range ~−8 to +8
+    done = result.done
+state = env.state()                          # Full ground truth (grading only)
+```
+`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders — agents must work from `Observation`.
+---
+## Action Space
+All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
+| Action | Required parameters | Description |
+|---|---|---|
+| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
+| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
+| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
+| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3×3 retardant drop with 5-step cooldown |
+| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
+| `recon_flight` | `target_row`, `target_col` | Reveal a 10×10 area for 5 steps |
+| `idle` | `reason` *(optional)* | Explicitly wait |
+A 3-layer parser (`env/action_parser.py`) maps raw LLM output → structured `Action`: direct JSON → regex field extraction → safe-`idle` fallback. **The environment loop never breaks on bad model output.**
+---
+## Observation Space
+| Component | Contents | Noise / occlusion |
+|---|---|---|
+| `briefing` | `OperationalBriefing` on first obs — incident ID, priority zones, infrastructure, wind forecast | First step only |
+| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
+| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | ±5 km/h, ±20° on medium/hard |
+| `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
+| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
+| `recent_events` | Last 5 notable events | Fully observable |
+The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
+---
+## Reward Function
+Decomposed for GRPO — wide reward range produces meaningful advantages between rollout groups.
+**Per-step (dense):**
+```
+step_reward = 0.4 · Δcontainment + 0.4 · Δpopulation_safety − 0.1 · redundant_action_flag
+```
+**Terminal (sparse, on episode end):**
+```
++5.0   if all populations safe
++0–2.0 efficiency bonus (faster containment ⇒ more)
++1.0   briefing-adherence bonus (all priority zones survived)
+−3.0 · (pop_lost / total_pop)   if any population lost
+−2.0   if any crew casualty
+−0.01 × invalid_action_count    capped at −0.2
+```
+Total empirical range: **−8 to +8**, declared in `openenv.yaml`.
+| Tier | Spread scale | Episode length | Approx. reward ceiling |
+|---|---|---|---|
+| Easy | 1.00× | 80 | +8 |
+| Medium | 0.70× | 150 | +7 |
+| Hard | 0.55× | 300 | +6 |
+---
+## Three Difficulty Tiers
+### Task 1 — Easy: Flatland Grass Fire
+15×15 flat grid · single ignition · constant wind · no smoke or fog-of-war · 4 crews, 1 tanker, 15 firebreak cells · 80 steps. **Focus:** basic deployment and perimeter control.
+### Task 2 — Medium: Canyon Terrain with Wind Shifts
+25×25 mixed terrain · two ignition points · variable wind · smoke occlusion · sensor noise · 5 crews, 2 tankers, 20 firebreak cells, 1 recon · 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
+### Task 3 — Hard: Wildland-Urban Interface Crisis
+40×40 terrain with roads, rivers, urban zones · staggered ignitions (step 30) · scripted crew casualty (step 40) · fog-of-war (radius 7) · aggressive wind shifts · 6 crews, 3 tankers, 30 firebreak cells, 3 recon · 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
+---
+## Fire Spread Model
+A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
+```
+P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
+            × (1 − moisture) × (1 − suppression) × tier_scale
+```
+| Factor | Effect |
+|---|---|
+| `base_rate` | Baseline spread by fuel type |
+| `fuel_factor` | Fuel load of the target cell |
+| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
+| `slope_factor` | Faster uphill, slower downhill |
+| `moisture` | Wet ground / recent rain reduces ignition probability |
+| `suppression` | Crew presence and retardant coverage reduce spread |
+| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
+Burning cells progress through `BURNING → EMBER → BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
+---
+## Results
+> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42–56 (15 per tier, no overlap with training seeds 0–99).
+| Agent | Easy (mean ± std) | Medium (mean ± std) | Hard (mean ± std) |
+|---|---|---|---|
+| Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
+| Heuristic | **+7.53 ± 0.08** | **+6.31 ± 2.77** | **+4.74 ± 3.79** |
+| **Trained Qwen-2.5-7B (ours)** | +5.13 ± 3.90 | **+5.74 ± 3.07** | +2.14 ± 2.87 |
+| **Δ vs. Heuristic** | −2.41 | **−0.58 ✓** | −2.59 |
+The medium tier result passes the ±1.0 of heuristic threshold (official passing criterion).
+**Auxiliary metrics for the trained agent:**
+| Metric | Easy | Medium | Hard |
+|---|---|---|---|
+| JSON success rate | 98.5% | 99.8% | 99.2% |
+| Mean population saved % | 87% | 97% | 92% |
+**Curriculum progression:** easy (steps 0���52) → medium (steps 53–62) → hard (steps 63–149). The model reached hard tier in just 63 of 150 training steps.
+> Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).
+---
+## Training
+We use a two-stage recipe:
+1. **SFT warm-up** — generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
+2. **GRPO (TRL `GRPOTrainer`)** — start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy → medium → hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
+**Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
+**Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.
+**Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git — generate with `python scripts/plot_grpo_training.py`).
+For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
+---
+## Project Structure
+```text
+Wildfire-Containment-Simulator/
+├── env/
+│   ├── wildfire_env.py       # Main env: reset(), step(), state()
+│   ├── models.py             # Pydantic action/observation/state models
+│   ├── grid.py               # Terrain, smoke, moisture, fog-of-war
+│   ├── fire_spread.py        # Cellular automaton fire propagation
+│   ├── weather.py            # Stochastic weather engine
+│   ├── resources.py          # Crews, tankers, firebreaks, recon
+│   ├── reward.py             # Decomposed step + terminal reward
+│   ├── briefing.py           # OperationalBriefing generation
+│   ├── serialization.py      # Observation → LLM prompt
+│   ├── action_parser.py      # LLM output → Action (3-layer fallback)
+│   ├── rendering.py          # Frame rendering for GIF replays
+│   └── curriculum.py         # CurriculumController (auto-promote/demote)
+├── agents/
+│   ├── random_agent.py
+│   └── heuristic_agent.py
+├── graders/
+│   ├── grader_easy.py        # → (total_reward, details_dict)
+│   ├── grader_medium.py
+│   └── grader_hard.py
+├── scripts/
+│   ├── evaluate.py           # Baseline eval (random + heuristic)
+│   ├── eval_compare.py       # Multi-agent comparison
+│   ├── eval_trained_model.py # Evaluate a trained adapter
+│   ├── generate_sft_data.py  # Build SFT dataset from heuristic rollouts
+│   ├── replay.py             # Render episode as GIF
+│   ├── run_demo.py           # Pitch demo
+│   └── plot_dashboard.py     # 4-panel training curves
+├── training/
+│   ├── grpo_v2_colab.ipynb   # GRPO notebook (canonical)
+│   ├── sft_colab.ipynb       # SFT warm-up notebook
+│   ├── sft_data.jsonl        # 4,300 SFT examples
+│   ├── requirements.txt      # Training deps (Unsloth, TRL, etc.)
+│   └── README.md
+├── server/
+│   └── app.py                # FastAPI on port 7860
+├── frontend/                 # Interactive HTML/JS frontend served at /ui/
+├── tests/                    # 41 pytest tests
+├── demos/                    # GIF/PNG demo assets
+├── openenv.yaml              # OpenEnv environment manifest
+├── Dockerfile                # HF Space build
+├── BLOG.md                   # Long-form write-up
+└── README.md                 # You are here
+```
+---
+## Architecture Decisions
+1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, −3 × loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
+2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic — protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
+3. **Two-stage training (SFT → GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
+4. **3-layer action parser.** JSON parse → regex fallback → safe-`idle`. The training loop never breaks on malformed model output.
+5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) → "What broke").
+6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem — every run is byte-for-byte reproducible.
+7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent — TRL, vLLM, an OpenAI-compatible API client, a curl loop — can drive it.
+---
+## Citation
+If you use this environment, please cite:
+```bibtex
+@misc{wildfire-containment-simulator-2026,
+  title  = {Wildfire Containment Simulator: Long-Horizon Planning and
+            Instruction Following for Disaster-Response LLM Agents},
+  author = {Team Wildfire},
+  year   = {2026},
+  url    = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
+  note   = {Meta OpenEnv Hackathon submission, Theme 2}
+}
+```
+---
+## License
+[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta × Hugging Face × Scaler hackathon, April 2026.