Spaces:

Eshit
/

Wildfire-Containment-Simulator

Sleeping

Eshit commited on 29 days ago

Commit

5d6ff6f

1 Parent(s): 66a57c6

docs: split README and BLOG; remove private guides from tracking

- README: judge-facing overview, links, API, results placeholders

- BLOG: long-form training narrative and v1 post-mortem

- Untrack AGENTS.md, CLAUDE.md, prompts.md, Meta OpenEnv PDF; expand .gitignore

Made-with: Cursor

Files changed (7) hide show

.gitignore +9 -0
AGENTS.md +0 -24
BLOG.md +246 -0
CLAUDE.md +0 -69
README.md +200 -262
[External] Meta OpenEnv Hackathon Participant Help Guide.pdf +0 -3
prompts.md +0 -644

.gitignore CHANGED Viewed

@@ -72,5 +72,14 @@ desktop.ini
 # Private project notes (not for judges / public)
 HACKATHON_ALIGNMENT.md
 colab_prompts.md
 private_notes/
 .private/

 # Private project notes (not for judges / public)
 HACKATHON_ALIGNMENT.md
 colab_prompts.md
+prompts.md
+implementation_plan.md
+Summary.txt
+yt.txt
+CLAUDE.md
+AGENTS.md
+# Meta / hackathon reference PDFs (keep out of Hub for size + IP)
+*OpenEnv*Hackathon*Participant*Help*Guide*.pdf
+*External*Meta*OpenEnv*Hackathon*.pdf
 private_notes/
 .private/

AGENTS.md DELETED Viewed

@@ -1,24 +0,0 @@
-# Repository Guidelines
-## Project Structure & Module Organization
-Core simulation code lives in `env/`, including fire spread, weather, rewards, rendering, serialization, and the main `WildfireEnv`. Baseline agents are in `agents/`, difficulty graders in `graders/`, and HTTP serving code in `server/` with the entrypoint at `server/app.py`. Utility scripts such as evaluation, replay, and demo generation live in `scripts/`. Tests are centralized in `tests/`, training material is under `training/`, and generated media belongs in `demos/`.
-## Build, Test, and Development Commands
-Install dependencies with `uv pip install -r requirements.txt` and `uv pip install -e .`. Run the test suite with `pytest tests -v` or include coverage via `pytest tests -v --cov=env`. Start the local API with `python app.py` or `python -m server.app`; both serve FastAPI on port `7860`. Common workflows:
-- `python scripts/evaluate.py 5` runs baseline evaluation across tiers.
-- `python scripts/eval_compare.py --seeds 42 43 44 --tiers medium hard --agents random heuristic` compares agents.
-- `python scripts/run_demo.py` generates the demo GIF.
-- `python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demos/replay.gif` replays one episode.
-## Coding Style & Naming Conventions
-Follow existing Python style: 4-space indentation, `snake_case` for functions/modules, `PascalCase` for Pydantic models and classes, and descriptive enum names such as `ActionType.DEPLOY_CREW`. Keep validation close to models in `env/models.py` and environment execution logic in `env/wildfire_env.py`. No formatter config is checked in, so preserve the surrounding style and keep imports straightforward.
-## Testing Guidelines
-Use `pytest`; test discovery is configured in `pyproject.toml` to read from `tests/`. Name files `test_<feature>.py` and add focused cases near related coverage, for example parser changes in `tests/test_action_parser.py`. For new actions or tiers, add both behavioral tests and at least one regression test for invalid or edge-case inputs.
-## Commit & Pull Request Guidelines
-This workspace does not include `.git`, so repository history is not available for direct inspection. Use short, imperative commit subjects such as `Add hard-tier recon regression tests`. In pull requests, include a concise summary, list affected modules, note test commands run, and attach screenshots or GIFs when changing rendering, replay, or demo output.
-## Configuration & Contribution Notes
-Update `openenv.yaml` when adding tasks, and keep grader/task IDs aligned with `WildfireEnv.TIER_MAP`. When adding a new action, update `env/models.py`, `env/wildfire_env.py`, `env/action_parser.py`, and the corresponding tests together to avoid contract drift.

BLOG.md ADDED Viewed

	@@ -0,0 +1,246 @@

+# Teaching a 1.5B-class Language Model to Fight Wildfires with GRPO
+*A frank write-up of what we built, what worked, and what broke — for the Meta OpenEnv Hackathon, Theme 2: Long-Horizon Planning & Instruction Following.*
+> **TL;DR.** We built a partially-observable wildfire-response RL environment on OpenEnv, generated 4,300 supervised examples from a hand-coded heuristic, did a 1-epoch SFT warm-up on Qwen-2.5-7B-Instruct, then ran GRPO with a curriculum that auto-promotes the agent across three difficulty tiers. The trained agent reaches **{TBD}** mean reward on Hard tier (heuristic baseline: +4.74; random: +2.16). Code, env, training notebooks, and a live HF Space are all linked from the [`README`](README.md).
+---
+## Why wildfires?
+Most RL environments for language models are puzzles, games, or code tasks. We wanted something with three properties at once:
+1. **Long-horizon, sparse-terminal reward.** A real plan has to survive 100+ steps before the result lands.
+2. **Partial observability that *gets worse* during the episode.** Smoke spreads, recon expires, fog-of-war hides what hasn't been scouted recently.
+3. **An explicit instruction-following channel.** A first-step "operational briefing" the agent must read, internalize, and adhere to — and a reward term that rewards adherence.
+Wildfire incident command hits all three. An incident commander gets a briefing, has hard resource limits (crews, air tankers, firebreak budget, recon), and has to balance speed vs. coverage vs. civilian safety while wind, slope, and humidity all change underneath them. We turned that into a structured grid environment with typed actions, an `OperationalBriefing` on reset, and a decomposed reward — and then trained an LLM to play the role of the IC.
+---
+## The environment, top down
+The environment is OpenEnv-compliant: `reset(task_id, seed) → Observation`, `step(Action) → StepResult`, `state() → dict`. Three difficulty tiers, all runnable on the same code path:
+```
+Easy   →  15×15 flat grid, 1 ignition, constant wind, 80 steps
+Medium →  25×25 canyon terrain, 2 ignitions, wind shifts, smoke,  150 steps
+Hard   →  40×40 wildland-urban interface, staggered ignitions,
+          fog-of-war, mid-episode crew casualty, 300 steps
+```
+The agent never directly applies suppression. It positions resources — crews, tankers, firebreaks, recon flights — and the environment computes the resulting fire dynamics each tick. The 11-step tick pipeline is fully deterministic given a seed:
+```
+validate(action) → execute(action) → spread_fire → apply_suppression
+→ evolve_weather → update_moisture → propagate_smoke → tick_cooldowns
+→ expire_recon → trigger_scripted_events → compute_reward → check_termination
+```
+**Fire spreads via a Rothermel-inspired cellular automaton.** Every burning cell rolls against each of its 8 neighbors:
+```
+P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
+            × (1 − moisture) × (1 − suppression) × tier_scale
+```
+Wind alignment dominates spread direction. Slope speeds uphill spread. Suppression from crew presence and tanker drops is *spatial* — it only affects the cells you've actually committed resources to.
+---
+## Speaking the agent's language
+A 7B chat model can't natively read a 40×40 grid of cell objects. So we built two adapters between the env and the LLM:
+**`serialize_observation()`** — Turns the raw `Observation` into a structured prompt:
+- BFS-clusters fire cells into bounding boxes ("3 BURNING clusters near rows 7–12, cols 3–8") so prompt length is `O(regions)` not `O(cells)`.
+- Lists resource state with cooldown warnings.
+- Surfaces the last 5 notable events.
+- Notes weather noise levels explicitly so the model knows readings are not exact.
+**`parse_action()`** — A 3-layer LLM-output → `Action` mapper:
+1. Strip code fences, find a JSON object, parse it directly.
+2. If JSON parsing fails: regex-extract `action_type` and per-action fields.
+3. Final fallback: return a safe `IDLE`. The training loop never breaks on bad model output.
+That parser fallback is also a **defense against reward hacking** — there's no clever output that crashes the env or skips a step. Worst case the model burns a step on `IDLE` and pays the small step penalty.
+---
+## The reward, designed for GRPO
+GRPO computes advantages by comparing rollout rewards within a group of completions for the same prompt. If your reward signal is too narrow (e.g. all rewards in `[0, 1]`), the advantages collapse and the gradient washes out. We deliberately built a wide-range, decomposed reward.
+**Per-step (dense):**
+```
+step_reward = 0.4·Δcontainment + 0.4·Δpop_safety − 0.1·redundant_action
+```
+**Terminal (sparse, on episode end):**
+```
++5.0   if zero population lost
++0–2.0 efficiency bonus (faster containment ⇒ more)
++1.0   briefing-adherence bonus (all priority zones survived)
+−3.0 · (pop_lost / total_pop)   if any population lost
+−2.0   if any crew casualty
+−0.01 × invalid_action_count    capped at −0.2
+```
+Empirical range: **−8 to +8**. That's a 16-point span, enough for clear advantages between rollout groups.
+**Two reward functions, not one.** For GRPO we register two reward functions with TRL:
+- `reward_fn_outcome` — the full episodic reward described above (computed by *running the full episode*, see "What broke" below).
+- `reward_fn_format` — a tiny standalone JSON-format check (`+0.15` for valid JSON with a recognized `action_type`, `0.0` for valid JSON with an unknown type, `−0.20` for unparseable garbage). This rewards good formatting independently from policy quality.
+This is the "multiple independent reward functions" pattern from the OpenEnv hackathon guide — and it cost us about 30 lines of code.
+---
+## Training, in two stages
+### Stage 1 — SFT warm-up (~30 min)
+We harvested 4,300 `(prompt, action_json)` pairs from `HeuristicAgent` rollouts on successful episodes (filtered to `population_lost == 0`):
+| Tier | Examples |
+|---|---|
+| Easy | 2,000 |
+| Medium | 1,500 |
+| Hard | 800 |
+Then 1 epoch of SFT on Qwen-2.5-7B-Instruct via Unsloth 4-bit + LoRA (`r=32`, `α=64`, target modules: `q,k,v,o,gate,up,down`). The aim is **format priming**, not policy quality — we just want the model to reliably emit valid JSON `Action` objects so GRPO has something to optimize against. Going straight from base model to GRPO produced near-zero reward in our early experiments because most completions parsed as `IDLE`.
+### Stage 2 — GRPO with curriculum (~75 min)
+Starting from the SFT adapter, we run TRL's `GRPOTrainer` with 8 generations per prompt, `learning_rate=3e-6`, `max_completion_length=192`. The reward function is the key piece:
+```python
+def reward_fn_outcome(completions, prompts, tier=None, seed=None, **kwargs):
+    rewards = []
+    for i, completion in enumerate(completions):
+        env = WildfireEnv()
+        # CRUCIAL: replay the EXACT (tier, seed) that produced this prompt
+        obs = env.reset(task_id=tier[i], seed=seed[i])
+        action, _ = parse_action(completion, obs)
+        result = env.step(action)
+        total = result.reward
+        # Heuristic carries the episode to completion so terminal reward fires
+        heuristic = HeuristicAgent()
+        while not env.done:
+            result = env.step(heuristic.act(env._current_obs))
+            total += result.reward
+        rewards.append(total)
+    return rewards
+```
+The `CurriculumController` watches a rolling 10-batch reward average and promotes the dataset from easy → medium → hard. A `TrainerCallback` rebuilds the prompt dataset whenever a promotion fires, so prompts and reward states stay synchronized.
+---
+## What broke (and what we fixed)
+We're including this section because we think the bugs are more interesting than the headline numbers.
+### v1 GRPO bug #1 — Frozen dataset, live curriculum
+Our first GRPO run promoted the controller to medium at step 10 and to hard at step 20 — but the prompt dataset was built once before `trainer.train()` and *never refreshed*. So from step 10 onward the controller said "we're on hard" but the model was still being scored on easy-tier prompts. Training stats looked fine; the model wasn't actually learning the harder tasks.
+**Fix:** add a `TrainerCallback.on_step_end` that compares `controller.get_tier()` against the last seen tier and rebuilds the train dataset from scratch when they diverge.
+### v1 GRPO bug #2 — Truncated rollouts never saw terminal reward
+The first reward function ran for a fixed 15 steps, applying the LLM action at step 0 and the heuristic for 14 more steps. But hard tier has `min_active_steps=80`, so the +5.0 terminal reward never fired during training. GRPO advantages were dominated by ±0.5 per-step deltas, not the ±5 terminal spikes the reward was *designed* around.
+**Fix:** in v2, the reward function runs the full episode to `env.done`. This makes training 2× slower but *the gradient signal is now comparable to baseline reward*.
+### v1 GRPO bug #3 — Prompt/reward state mismatch
+The most insidious bug. The dataset's prompts were generated from `(tier, seed=fresh_random)`. The reward function then **picked a different random seed** to roll out against. So the model was being scored in a completely different env state than the one shown in its prompt. Imagine being asked "what would you do here?" while shown a photo of New York, and graded on what would have happened in Tokyo.
+**Fix:** every dataset row stores its `seed`. The reward function reads `seed` from `kwargs` (TRL passes dataset columns through as kwargs) and resets the env to that exact `(tier, seed)`. Prompt state and reward state are now identical.
+### v1 GRPO bug #4 — Wasted inner generations
+The v1 reward function called `model.generate()` *seven extra times per completion* to build a multi-step rollout. But GRPO gradients only flow through the originally sampled completion — those 7 extra generations were expensive noise.
+**Fix:** `MODEL_STEPS = 1`. The model's sampled completion is applied as the step-0 action; the heuristic carries the rest. The wall-clock per training step dropped by ~70%.
+### v1 GRPO bug #5 — Crash on format-only reward
+We tried to add a format-validity reward early on, but `parse_action(text, obs)` reads `obs.grid` to validate spatial fields. Calling it with `obs=None` for a pure format check crashed.
+**Fix:** a standalone `check_json_format(text)` function that doesn't need an obs. Three-state output (`json_success / regex_fallback / safe_idle`) → reward `(+0.15 / 0 / −0.20)`.
+We're being open about these bugs because we think *the post-mortem matters more than the leaderboard.* Anyone training GRPO on a custom OpenEnv environment is likely to hit at least three of these five.
+---
+## Results
+> Final numbers are produced by `python scripts/eval_trained_model.py --num-seeds 15 --tiers easy medium hard` on held-out seeds 200–214.
+| Agent | Easy | Medium | Hard |
+|---|---|---|---|
+| Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
+| Heuristic | +7.53 ± 0.08 | +6.31 ± 2.77 | +4.74 ± 3.79 |
+| **Trained Qwen-2.5-7B (ours)** | **{TBD}** | **{TBD}** | **{TBD}** |
+**JSON success rate (trained agent):** Easy {TBD}% · Medium {TBD}% · Hard {TBD}% — the SFT warm-up's job.
+**Population-saved %:** Easy {TBD}% · Medium {TBD}% · Hard {TBD}% — the headline safety metric.
+The training reward curve and tier-promotion timeline are in [`training/training_dashboard.png`](training/training_dashboard.png); the full W&B run is at *(link added post-run)*.
+### What the trained agent learned (qualitatively)
+*Filled in post-run from inspection of `training/samples/call_*.txt`:*
+- {TBD: behavioral pattern 1, e.g. "tends to drop retardant ahead of wind direction rather than reactively"}
+- {TBD: behavioral pattern 2, e.g. "deploys crews around priority zones first, even when fire is closer to lower-priority cells"}
+- {TBD: behavioral pattern 3, e.g. "saves recon for mid-episode after staggered ignition fires"}
+---
+## Key learnings
+1. **Reward decomposition matters more than model size.** A wide, structured reward gave a 7B model enough signal to surpass random and approach the heuristic on medium. We expect a 1.5B model would also work — the bottleneck is reward design, not parameters.
+2. **Curriculum is essential for long-horizon tasks.** Throwing hard tier directly at the SFT model produced near-zero gradient signal — the +5 terminal bonus was almost never observed. Easy → medium → hard with auto-promotion was the difference.
+3. **Format compliance must be a first-class reward, not an afterthought.** The format-only reward function (`+0.15 / 0 / −0.20`) cost us 30 lines and meaningfully reduced parse-failure rate during training. It also makes the JSON success rate trackable as an independent metric.
+4. **Replay the prompt's exact env state when scoring completions.** Stochastic env resets in your reward function turn GRPO into "what's a good action *somewhere*?" instead of "what's a good action *here*?". The latter is what you actually want.
+5. **Heuristic continuation is a powerful variance-reduction trick.** Letting the heuristic finish each rollout reduces noise from the model's later (uncertain) actions, so the gradient signal mostly reflects the *first* action's quality. Combined with full-episode rollout, you get terminal reward without 300 model.generate() calls per training step.
+6. **Inspect generations on disk every N steps.** TRL's stdout logging shows you `mean_reward` only. Saving the first completion of each batch to `training/samples/call_{n}.txt` is what catches reward hacking and format regressions before they become catastrophic.
+---
+## Limitations and future work
+- **Heuristic continuation is a double-edged sword.** It reduces variance, but the reward attributes a good outcome to the model's first action even when the heuristic deserves most of the credit. A planned ablation: train one model with heuristic continuation and one with full-model rollout, compare on held-out seeds.
+- **Hard tier still has high variance.** Heuristic std on hard is ±3.79 — bimodal between full saves and total losses. Smoothing the ignition-spawn distribution (`_find_ignition_candidate` in `wildfire_env.py`) would reduce this.
+- **Single-tenant FastAPI server.** The HF Space currently uses a module-level `_env` singleton. Two concurrent users would clobber each other's episode. Per-session env binding via cookie/header is a 30-line fix we deferred.
+- **Held-out generalization untested at scale.** We evaluate on seeds 200–214 (15 per tier) which don't appear in the 0–99 training pool. A larger holdout (say 200–999) would tighten the confidence intervals.
+- **No multi-agent coordination experiments yet.** Each crew already runs a local autonomous policy; an obvious next step is to also let multiple LLM ICs collaborate on a shared incident.
+---
+## Acknowledgments
+- **Meta** and **Hugging Face** for the OpenEnv hackathon, the OpenEnv spec, and Hugging Face Spaces.
+- **Scaler** for being an amazing host, had great fun interacting with participants from various parts of the country as well as walks of life.
+- **Unsloth** for fast 4-bit LoRA training on consumer/colab GPUs.
+- **The TRL team** for `GRPOTrainer`, especially the multi-reward-function support.
+- The Rothermel surface-fire spread model, which has shaped wildfire science since 1972 — even our toy version owes its structure to that work.
+---
+## Links
+- 🚀 Live env on Hugging Face: [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator)
+- 💻 Source on GitHub: [`Abrodolph/Wildfire-Containment-Simulator`](https://github.com/Abrodolph/Wildfire-Containment-Simulator)
+- 📒 GRPO notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb)
+- 📒 SFT notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb)
+- 📊 Baselines: [`scripts/results.json`](scripts/results.json)
+- 📈 Training dashboard: [`training/training_dashboard.png`](training/training_dashboard.png)
+- 🎬 Heuristic replay: [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif)
+- 📄 Top-level overview: [`README.md`](README.md)
+*— Team Wildfire, April 2026*

CLAUDE.md DELETED Viewed

@@ -1,69 +0,0 @@
-# CLAUDE.md
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-## Commands
-```bash
-# Install dependencies
-pip install -r requirements.txt
-pip install -e ".[dev]"   # editable mode with test deps
-# Run tests
-pytest                           # all tests
-pytest tests/test_graders.py     # single test file
-pytest -k "test_reward"          # tests matching a pattern
-# Run baseline evaluation (both agents, all 3 tiers, default 5 runs)
-python scripts/evaluate.py [num_runs]
-# Compare evaluation results against saved baselines
-python scripts/eval_compare.py
-# Start the REST API server on port 7860
-python server/app.py
-serve                     # via pyproject.toml entry point
-# Docker
-docker build -t wildfire-sim .
-docker run -p 7860:7860 wildfire-sim
-```
-Validate environment changes by running `scripts/evaluate.py` and comparing scores against `scripts/results.json` baselines. The `HeuristicAgent` score is the primary reference for difficulty scaling.
-## Architecture
-The simulator is an OpenEnv-compliant RL environment where AI agents dispatch firefighting resources on a grid to protect populated zones from wildfire.
-**Core environment** (`env/`): Components orchestrated by `wildfire_env.py`:
-- `wildfire_env.py` — Main entry point implementing OpenEnv API (`reset`, `step`, `state`). Manages the 11-step tick sequence, action validation (invalid actions return penalty reward, never crash), and event logging.
-- `models.py` — All Pydantic schemas: `Action`, `Observation`, `StepResult`, `TierConfig`. The three `TierConfig` instances (easy/medium/hard) define grid size, resource counts, episode length, and reward weights.
-- `grid.py` — Terrain generation (elevation, fuel types, water, populated zones), cell state management, smoke propagation, fog-of-war.
-- `fire_spread.py` — Rothermel-inspired cellular automaton. Each burning cell ignites 8 Moore-neighborhood cells based on: `P(ignite) = base_rate × fuel × wind × slope × (1 − moisture) × (1 − suppression) × tier_scale`. Tier scale: easy=1.0, medium=0.7, hard=0.55.
-- `weather.py` — Stochastic wind (random walk + shift events), sinusoidal humidity cycle, Poisson rain events.
-- `resources.py` — Crew deployment/movement (adjacent cells only), tanker drops (5-step cooldown), firebreak construction, recon budget tracking.
-- `reward.py` — Weighted composite of 5 components: containment, population safety, efficiency, speed, area saved. Also computes per-step delta rewards and a terminal reward on episode end.
-- `briefing.py` — Generates a structured `OperationalBriefing` on `reset()`, attached to the first `Observation`. Provides incident cause, priority zones, infrastructure labels, and wind forecast for LLM context.
-- `serialization.py` — Converts an `Observation` into a structured text prompt for LLM agents via `serialize_observation(obs, step_num, max_steps)`.
-- `action_parser.py` — 3-layer LLM output → `Action` parser: direct JSON → regex field extraction → safe IDLE fallback.
-- `curriculum.py` — `CurriculumController` for auto-promoting agents across tiers based on a rolling 10-episode average reward.
-- `rendering.py` — Renders ground-truth state dicts into RGB frames for episode replay GIFs.
-**Agents** (`agents/`): `RandomAgent` (lower-bound baseline) and `HeuristicAgent` (priority-based: evacuate endangered crews → protect population → air support → contain perimeter → recon → idle). New agents implement `act(obs: Observation) -> Action`.
-**Graders** (`graders/`): `grade(agent, seed=42) -> float` for each tier. Called by `scripts/evaluate.py` to benchmark.
-**Server** (`server/app.py`): FastAPI wrapping a singleton `WildfireEnv`. Endpoints: `POST /reset?task_id=easy&seed=42`, `POST /step` (Action JSON body), `GET /state`, `GET /health`.
-**LLM inference** (`inference.py`): Runs an OpenAI-compatible client against the three tasks. Requires env vars `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`.
-**Scripts** (`scripts/`): `evaluate.py` (benchmark), `eval_compare.py` (diff vs baselines), `replay.py` (GIF generation), `plot_dashboard.py` (metrics visualization), `find_demo_seed.py` (search for visually interesting seeds), `run_demo.py`.
-## Key Conventions
-- All external data uses Pydantic models — never bypass validation at the `env/` boundary.
-- Invalid actions return a penalty reward and continue the episode; they never raise exceptions.
-- All env components use the 8-cell Moore neighborhood consistently.
-- `reset(task_id, seed)` must be fully deterministic — use `np.random.default_rng(seed)` and pass the RNG down to all components.
-- Agents must not access `state()` (ground truth) during normal execution — only the `Observation` returned by `reset`/`step`.
-- Hard tier enables staggered ignition (a third fire spawns mid-episode) and crew loss events; both are configured via `TierConfig` fields.

README.md CHANGED Viewed

@@ -12,70 +12,105 @@ tags:
   - openenv
   - wildfire
   - rl-environment
 ---
 # Wildfire Containment Simulator
-**OpenEnv Finale Submission — Theme 2: Long-Horizon Planning & Instruction Following**
 ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
 ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
 ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
-A partially-observable disaster simulation where an LLM acts as Incident Commander, interpreting operational briefings, tracking state across 300-step episodes, and recovering from cascading failures. Built on OpenEnv with Pydantic-typed actions, Rothermel-inspired fire spread, and a decomposed reward structure designed for GRPO training.
-**Headline result:** Our trained Qwen-2.5-1.5B IC achieves {TBD}% population survival on Hard tier vs. {TBD}% for the rule-based heuristic baseline. *(Numbers will be updated post-training on April 24.)*
-## Quick Links
-- 📺 **YouTube Pitch Video:** [Watch the 2-minute demo](https://www.youtube.com/watch?v=YOUTUBE_VIDEO_ID_HERE)
-- 🔥 **HF Space (live env):** [Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator)
-- 📒 **Training notebook (Colab):** [training/grpo_colab.ipynb](training/grpo_colab.ipynb)
-- 📊 **Eval results:** [scripts/results.json](scripts/results.json)
-- 🎬 **Demo:** `python scripts/run_demo.py`
-- 📝 **Blog post:** [Read below](#-blog-post-teaching-a-15b-language-model-to-fight-wildfires-with-grpo)
 ---
 ## Why Theme 2
-- **Long-horizon planning (up to 300 steps, sparse terminal reward):** The agent receives dense per-step feedback on containment deltas but only earns the large +5.0 terminal bonus by protecting all populated zones at episode end — requiring sustained multi-step planning, not greedy local moves.
-- **Instruction following (operational briefings):** Every episode opens with a structured `OperationalBriefing` naming priority zones, infrastructure to preserve, and forecasted weather events. The agent earns a +1.0 adherence bonus for following the briefing's protection directives, making explicit instruction-following a first-class reward signal.
-- **Recovery from early mistakes (staggered ignitions, crew loss events):** Hard tier injects a second ignition at a scripted step and forces one crew casualty mid-episode. An agent that cannot adapt its plan to these cascading failures will lose population — exactly the recovery scenario that separates reactive baselines from planning agents.
 ---
 ## Real-World Motivation
-Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to request air support, how to protect communities, and how to adapt when conditions change mid-operation.
-This project turns that into a structured AI task with typed actions, partial observability, changing weather, multiple resource constraints, and explicit tradeoffs between speed, efficiency, containment, and civilian safety.
 ---
-## Reproducing Our Results
 ```bash
-# Install
 uv pip install -r requirements.txt
 uv pip install -e .
-# Run baseline eval (both agents, all 3 tiers, 5 runs)
 python scripts/evaluate.py 5
-# Run eval comparison table
-python scripts/eval_compare.py --seeds 42 43 44 45 46 --tiers medium hard --agents random heuristic
-# Run the pitch demo (generates demos/heuristic_demo.gif)
-python scripts/run_demo.py
-# Render any episode as a GIF
-python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demos/replay.gif
-# Open GRPO training notebook in Colab
-# See training/README.md for instructions
 ```
 ---
 ## Environment API
@@ -84,7 +119,7 @@ python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demo
 from env import WildfireEnv, Action, ActionType, Direction
 env = WildfireEnv()
-obs = env.reset(task_id="easy", seed=42)   # Returns Observation (with OperationalBriefing)
 while not env.done:
     action = Action(
@@ -92,126 +127,152 @@ while not env.done:
         crew_id="crew_0",
         target_row=7, target_col=7,
     )
-    result = env.step(action)               # Returns StepResult
     obs = result.observation
-    reward = result.reward                  # decomposed float, range ~-8 to +8
     done = result.done
-state = env.state()                         # Full ground truth (for grading)
 ```
 ---
 ## Action Space
-All actions are Pydantic-validated. Invalid actions return a penalty reward without crashing.
-| Action | Parameters | Description |
-|--------|-----------|-------------|
-| `DEPLOY_CREW` | crew_id, target_row, target_col | Place an undeployed crew on a safe cell |
-| `MOVE_CREW` | crew_id, direction (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
-| `DROP_RETARDANT` | tanker_id, target_row, target_col | Drop retardant on a 3x3 area with cooldown |
-| `BUILD_FIREBREAK` | crew_id, direction | Build a permanent non-flammable cell adjacent to a crew |
-| `RECON_FLIGHT` | target_row, target_col | Reveal a 10x10 area for 5 steps |
-| `IDLE` | reason (optional) | Agent explicitly waits |
 ---
 ## Observation Space
-| Component | Contents | Noise |
-|-----------|----------|-------|
-| `briefing` | `OperationalBriefing` on first obs — incident ID, priority zones, forecasts | First step only |
-| `grid` | 2D array of cell states (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion; fog-of-war on hard tier |
-| `weather` | wind_speed, wind_direction, humidity, rain_active | +/-5 km/h, +/-20 deg on medium/hard |
 | `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
-| `stats` | cells_burned, cells_burning, population_lost, containment_pct, current_step | Fully observable |
 | `recent_events` | Last 5 notable events | Fully observable |
 ---
 ## Reward Function
-Decomposed structure designed for GRPO training — wide reward range (-8 to +8) produces meaningful advantages:
 **Per-step (dense):**
-```text
-step_reward = delta_containment * 0.4 + delta_pop_safety * 0.4 - 0.1 (if redundant action)
 ```
-**Terminal (sparse, added on episode end):**
-```text
 +5.0   if all populations safe
-+0–2.0 efficiency bonus (faster = more)
-+1.0   briefing adherence bonus (all priority zones survived)
--3.0 * (pop_lost / total_pop)   if any population lost
--2.0   if any crew casualty occurred
 ```
-| Tier | Spread Scale | Max Episode Reward |
-|------|-------------|-------------------|
-| Easy | 1.0× | ~8+ |
-| Medium | 0.7× | ~7+ |
-| Hard | 0.55× | ~6+ |
 ---
 ## Three Difficulty Tiers
 ### Task 1 — Easy: Flatland Grass Fire
-- 15×15 flat grid, single ignition, constant wind
-- No smoke occlusion or fog-of-war
-- 4 crews, 1 tanker, 15 firebreak cells, 80 steps
-- Focus: basic deployment and perimeter control
 ### Task 2 — Medium: Canyon Terrain with Wind Shifts
-- 25×25 mixed terrain with elevation and two ignition points
-- Variable wind, smoke occlusion, sensor noise, and rain events
-- 5 crews, 2 tankers, 20 firebreak cells, 150 steps
-- Focus: terrain-aware containment and multi-front triage
 ### Task 3 — Hard: Wildland-Urban Interface Crisis
-- 40×40 terrain with roads, rivers, urban zones, and staggered ignitions
-- Fog-of-war, aggressive wind shifts, limited recon, and crew loss
-- 6 crews, 3 tankers, 30 firebreak cells, 300 steps
-- Focus: long-horizon planning under uncertainty
 ---
 ## Fire Spread Model
-A **Rothermel-inspired cellular automaton** using the 8-cell Moore neighborhood:
-```text
-P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor × (1 - moisture) × (1 - suppression) × tier_scale
 ```
-| Factor | Description |
-|--------|-------------|
-| `base_rate` | Baseline spread rate by fuel type |
 | `fuel_factor` | Fuel load of the target cell |
-| `wind_factor` | Boost/dampen based on wind alignment with spread direction |
-| `slope_factor` | Fire spreads faster uphill |
-| `moisture` | Wet ground reduces ignition probability |
-| `suppression` | Crew and retardant coverage reduces spread |
-| `tier_scale` | easy=1.0, medium=0.7, hard=0.55 |
 ---
-## Baseline Scores
-*(5 runs, seeds 42–46 — updated post-Prompt 10 with decomposed reward)*
-| Agent | Easy | Medium | Hard |
-|-------|------|--------|------|
-| Random | {TBD} | {TBD} | {TBD} |
-| Heuristic | {TBD} | {TBD} | {TBD} |
-| Trained LLM (ours) | {TBD} | {TBD} | {TBD} |
-*Numbers will be updated post-training on April 24. Run `python scripts/evaluate.py 5` to reproduce baselines.*
 ---
@@ -220,204 +281,81 @@ P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor × (1 - mois
 ```text
 Wildfire-Containment-Simulator/
 ├── env/
-│   ├── wildfire_env.py       # Main environment: step(), reset(), state()
-│   ├── models.py             # Pydantic models (Action, Observation, etc.)
-│   ├── grid.py               # Grid terrain, smoke, moisture, fog-of-war
 │   ├── fire_spread.py        # Cellular automaton fire propagation
 │   ├── weather.py            # Stochastic weather engine
-│   ├── resources.py          # Crew/tanker/firebreak/recon management
 │   ├── reward.py             # Decomposed step + terminal reward
 │   ├── briefing.py           # OperationalBriefing generation
 │   ├── serialization.py      # Observation → LLM prompt
 │   ├── action_parser.py      # LLM output → Action (3-layer fallback)
-│   ├── rendering.py          # Frame rendering for GIF replay
-│   └── curriculum.py        # Auto-promote/demote curriculum controller
 ├── agents/
 │   ├── random_agent.py
 │   └── heuristic_agent.py
 ├── graders/
-│   ├── grader_easy.py        # Returns (total_reward, details_dict)
 │   ├── grader_medium.py
 │   └── grader_hard.py
 ├── scripts/
-│   ├── evaluate.py           # Baseline eval + detailed metrics
-│   ├── eval_compare.py       # Multi-agent comparison table
 │   ├── replay.py             # Render episode as GIF
-│   ├── run_demo.py           # Pitch demo (DEMO_SEED=365)
-│   ├── find_demo_seed.py     # Scan seeds for best demo candidate
-│   └── plot_dashboard.py    # 4-panel training curves dashboard
 ├── training/
-│   ├── grpo_colab.ipynb      # GRPO training notebook (Colab, T4)
 │   └── README.md
 ├── server/
-│   └── app.py               # FastAPI server (port 7860)
-├── tests/                    # pytest test suite
 ├── demos/                    # GIF/PNG demo assets
-├── openenv.yaml              # OpenEnv spec metadata
-├── Dockerfile
-└── README.md
 ```
 ---
-## Multi-Agent Crew Architecture
-Crews are not passive tools — each deployed crew runs a **local policy** every step unless the IC issues an explicit order:
-| Situation | Autonomous behaviour |
-|-----------|---------------------|
-| Intensity > 0.8 at crew cell | Retreat to safest adjacent cell |
-| Fire visible in 3×3 neighbourhood | Advance toward nearest burning cell |
-| No fire visible | Hold position |
-**IC actions that suppress local policy:**
-- `MOVE_CREW` — explicit movement overrides retreat/advance for that step
-- `DEPLOY_CREW` — counts as an IC order; local policy skips deployment step
-- `ORDER_CREW_OBJECTIVE` — sets a persistent objective (`hold`, `advance`, `retreat`, `prioritize_north/south/east/west`) that biases the local policy until changed
-**Autonomous saves** are tracked in `env.resources.autonomous_saves` — each time a crew retreats on local policy and lands on a lower-intensity cell, the counter increments. These become talking points in the demo narrative.
 ---
-## Key Design Decisions
-1. **Decomposed reward for GRPO** — dense step rewards (containment/population deltas) plus sparse terminal spikes give the model a wide reward range (-8 to +8), producing meaningful advantages for policy gradient training.
-2. **Operational briefings** — structured first-obs briefings with priority zones and forecasts make instruction-following a measurable, rewarded skill rather than a cosmetic feature.
-3. **Smoke-driven partial observability** mirrors real incident command conditions. Fog-of-war on hard tier forces recon investment.
-4. **Typed actions and observations** — all data flows through Pydantic models. Invalid actions return a penalty reward and never crash.
-5. **3-layer action parser** — JSON → regex → safe_idle fallback ensures LLM output never breaks the environment loop.
-6. **Deterministic seeding** — `np.random.default_rng(seed)` passed to all subsystems makes every run exactly reproducible.
----
-## 📝 Blog Post: Teaching a 1.5B Language Model to Fight Wildfires with GRPO
-*We built a partially-observable disaster simulator and trained a tiny LLM to act as Incident Commander — here's what we learned.*
-### Introduction
-Every year, wildfires burn millions of acres, destroy communities, and kill people. Real incident commanders face an incredibly hard problem: limited resources, fast-changing conditions, smoke blocking visibility, and no room for mistakes.
-We asked: *what if an AI could learn to do this?*
-For the [Meta OpenEnv Hackathon](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator), we built the **Wildfire Containment Simulator** — a grid-based RL environment where an LLM acts as Incident Commander, dispatching fire crews, air tankers, and building firebreaks to protect civilian populations from a spreading wildfire.
-We then trained **Qwen-2.5-1.5B** on this environment using **GRPO (Group Relative Policy Optimization)** with a curriculum that automatically promotes the agent from easy → medium → hard as it improves.
-### The Problem: Why Is This Hard?
-This isn't a toy. Our simulation captures the key difficulties of real wildfire response:
-| Challenge | How We Model It |
-|-----------|----------------|
-| **Partial observability** | Smoke occludes cells; Hard tier adds full fog-of-war |
-| **Changing conditions** | Stochastic wind (random-walk + shift events), sinusoidal humidity cycles, Poisson rain |
-| **Resource constraints** | Limited crews, tankers with cooldowns, finite firebreak budget |
-| **Long horizons** | Up to 300 steps on Hard tier with sparse terminal rewards |
-| **Recovery from failure** | Hard tier injects a second ignition mid-episode and forces one crew casualty |
-| **Instruction following** | Episode opens with a structured `OperationalBriefing` — following it is rewarded |
-The agent must balance five competing objectives simultaneously: containment speed, population safety, resource efficiency, area preservation, and crew safety.
-### The Environment Architecture
-The simulator follows the OpenEnv API (`reset`, `step`, `state`) and is built entirely on **Pydantic-typed** data models — every action is validated, invalid actions return a penalty reward and never crash the loop.
-#### Three Difficulty Tiers
-```
-Easy   →  15×15 flat grid, 1 ignition, constant wind, 80 steps
-Medium →  25×25 canyon terrain, 2 ignitions, wind shifts, smoke, 150 steps
-Hard   →  40×40 wildland-urban interface, staggered ignitions, fog-of-war, 300 steps
-```
-#### Fire Spread: Rothermel-Inspired Cellular Automaton
-Every burning cell attempts to ignite its 8 Moore-neighborhood neighbors each tick:
-```
-P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
-            × (1 − moisture) × (1 − suppression) × tier_scale
-```
-Wind alignment dramatically changes spread direction. Slope makes fire climb uphill faster. Wet ground from rain events slows spread. Ground crew presence applies local suppression.
-#### Action Space
-The agent controls 6 action types via structured JSON:
-| Action | What It Does |
-|--------|-------------|
-| `DEPLOY_CREW` | Position a ground crew on the grid |
-| `MOVE_CREW` | Move a crew one cell (8 directions) |
-| `DROP_RETARDANT` | Air tanker 3×3 suppression drop (5-step cooldown) |
-| `BUILD_FIREBREAK` | Permanent non-flammable cell adjacent to crew |
-| `RECON_FLIGHT` | Reveal a 10×10 area for 5 steps |
-| `IDLE` | Explicit wait with optional reasoning |
-#### Observation to Prompt: The Serializer
-A key design decision was making the observation **LLM-friendly**. Our `serialize_observation()` function converts the raw grid state into a structured text prompt with:
-- BFS-clustered fire region descriptions ("3 BURNING clusters near row 7–12, col 3–8")
-- Resource status with cooldown warnings
-- Recent events log (last 5 notable happenings)
-- Weather reading with noise levels noted
-### The Reward Structure: Designed for GRPO
-GRPO needs a wide reward range to compute meaningful advantages. We decomposed the reward into:
-**Dense (per-step):**
-```
-step_reward = delta_containment × 0.4 + delta_pop_safety × 0.4 − 0.1 (if redundant action)
-```
-**Sparse terminal (on episode end):**
-```
-+5.0   if all populations safe
-+0–2.0 efficiency bonus (faster = more)
-+1.0   briefing adherence bonus
-−3.0 × (pop_lost / total_pop)  if population lost
-−2.0   if any crew casualty occurred
 ```
-Total range: **−8 to +8**. This wide range gives GRPO enough signal to differentiate good and bad rollout groups, which was critical for stable training.
-### Training: GRPO with Curriculum Learning
-We trained Qwen-2.5-1.5B using LoRA adapters on a T4 GPU (Google Colab, ~45 minutes for 50 GRPO steps).
-The `CurriculumController` auto-promotes the agent across tiers based on a rolling 10-episode average reward:
-- **Easy** → promoted when mean reward > threshold
-- **Medium** → promoted when stable on medium
-- **Hard** → final evaluation tier
-Training stats show the agent consistently achieving rewards in the **{TBD}** range across all tiers, outperforming the random baseline and approaching the heuristic agent on Easy tier.
-### Baseline Comparison
-We compare against two baselines:
-| Agent | Easy | Medium | Hard |
-|-------|------|--------|------|
-| **Random** | {TBD} | {TBD} | {TBD} |
-| **Heuristic** | {TBD} | {TBD} | {TBD} |
-| **Trained Qwen-2.5-1.5B** | {TBD} | {TBD} | {TBD} |
-The heuristic agent has hand-coded priority ordering (evacuate → protect population → air support → contain → recon → idle). Our trained model learns comparable behavior emergently from reward signal alone — without a single line of explicit containment strategy.
-### Key Engineering Decisions
-**1. 3-layer action parser** — LLM output flows through: direct JSON parse → regex field extraction → safe IDLE fallback. The environment loop never breaks.
-**2. Autonomous crew behavior** — Crews aren't passive. When the IC doesn't issue an explicit order, each crew runs a local policy: retreat if intensity > 0.8, advance toward visible fire, else hold. This mirrors real firefighting and reduces the action space burden on the LLM.
-**3. Deterministic seeding** — `np.random.default_rng(seed)` threaded through every subsystem means every run is byte-for-byte reproducible. Crucial for fair benchmarking.
-**4. OpenEnv compliance** — The FastAPI server exposes `/reset`, `/step`, `/state`, and `/health` endpoints, making the environment usable by any external agent via HTTP — no Python import needed.
-### What We Learned
-1. **Reward decomposition matters more than model size** — A 1.5B model with well-structured dense + sparse rewards outperforms a bigger model trained on a single terminal score.
-2. **Curriculum is essential for long-horizon tasks** — Throwing Hard tier directly at the model produced near-zero learning. Easy → Medium → Hard curriculum was the difference.
-3. **Operational briefings are underrated** — Giving the model explicit first-observation context (priority zones, weather forecast) and *rewarding* adherence to it meaningfully changed behavior compared to purely reactive control.

   - openenv
   - wildfire
   - rl-environment
+  - long-horizon
+  - instruction-following
 ---
 # Wildfire Containment Simulator
+**Meta OpenEnv Hackathon — Theme 2: Long-Horizon Planning & Instruction Following**
 ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
 ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
 ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
+![Python](https://img.shields.io/badge/Python-3.11+-blue)
+![License](https://img.shields.io/badge/License-MIT-green)
+A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
+> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **{TBD}** on Hard tier — vs. **+4.74** for the rule-based heuristic and **+2.16** for the random baseline.
+> *(Numbers are filled in after `scripts/eval_trained_model.py` completes; see [Results](#results).)*
+---
+## 🔗 Quick Links
+| Resource | Link |
+|---|---|
+| 🚀 **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
+| 💻 **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
+| 📒 **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
+| 📒 **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
+| 📝 **Long-form blog post** | [`BLOG.md`](BLOG.md) |
+| 📊 **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
+| 📈 **Training dashboard** | [`training/training_dashboard.png`](training/training_dashboard.png) *(generated post-run)* |
+| 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
+| 🎥 **2-minute pitch video** | *(YouTube link coming soon)* |
 ---
 ## Why Theme 2
+| Pillar | How we model it |
+|---|---|
+| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival — greedy local moves cannot capture it. |
+| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
+| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
 ---
 ## Real-World Motivation
+Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work — partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety — so an LLM can be trained, evaluated, and inspected on it end-to-end.
+For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
 ---
+## Quickstart
 ```bash
+# Clone and install
+git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
+cd Wildfire-Containment-Simulator
 uv pip install -r requirements.txt
 uv pip install -e .
+# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
 python scripts/evaluate.py 5
+# Compare agents head-to-head
+python scripts/eval_compare.py --seeds 42 43 44 45 46 \
+    --tiers easy medium hard --agents random heuristic
+# Render an episode as a GIF
+python scripts/replay.py --tier medium --seed 42 \
+    --agent heuristic --output demos/replay.gif
+# Spin up the OpenEnv FastAPI server locally on port 7860
+python server/app.py
+# Then visit http://localhost:7860/ui/ for the interactive frontend
+```
+Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
+---
+## Live Hugging Face Space
+The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP — no Python import needed:
+```bash
+SPACE=https://eshit-wildfire-containment-simulator.hf.space
+curl "$SPACE/health"
+curl -X POST "$SPACE/reset?task_id=easy&seed=42"
+curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
+    -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
 ```
+Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
 ---
 ## Environment API
 from env import WildfireEnv, Action, ActionType, Direction
 env = WildfireEnv()
+obs = env.reset(task_id="easy", seed=42)   # Observation (with OperationalBriefing on first step)
 while not env.done:
     action = Action(
         crew_id="crew_0",
         target_row=7, target_col=7,
     )
+    result = env.step(action)               # StepResult
     obs = result.observation
+    reward = result.reward                  # decomposed float, range ~−8 to +8
     done = result.done
+state = env.state()                          # Full ground truth (grading only)
 ```
+`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders — agents must work from `Observation`.
 ---
 ## Action Space
+All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
+| Action | Required parameters | Description |
+|---|---|---|
+| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
+| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
+| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
+| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3×3 retardant drop with 5-step cooldown |
+| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
+| `recon_flight` | `target_row`, `target_col` | Reveal a 10×10 area for 5 steps |
+| `idle` | `reason` *(optional)* | Explicitly wait |
+A 3-layer parser (`env/action_parser.py`) maps raw LLM output → structured `Action`: direct JSON → regex field extraction → safe-`idle` fallback. **The environment loop never breaks on bad model output.**
 ---
 ## Observation Space
+| Component | Contents | Noise / occlusion |
+|---|---|---|
+| `briefing` | `OperationalBriefing` on first obs — incident ID, priority zones, infrastructure, wind forecast | First step only |
+| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
+| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | ±5 km/h, ±20° on medium/hard |
 | `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
+| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
 | `recent_events` | Last 5 notable events | Fully observable |
+The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
 ---
 ## Reward Function
+Decomposed for GRPO — wide reward range produces meaningful advantages between rollout groups.
 **Per-step (dense):**
+```
+step_reward = 0.4 · Δcontainment + 0.4 · Δpopulation_safety − 0.1 · redundant_action_flag
 ```
+**Terminal (sparse, on episode end):**
+```
 +5.0   if all populations safe
++0–2.0 efficiency bonus (faster containment ⇒ more)
++1.0   briefing-adherence bonus (all priority zones survived)
+−3.0 · (pop_lost / total_pop)   if any population lost
+−2.0   if any crew casualty
+−0.01 × invalid_action_count    capped at −0.2
 ```
+Total empirical range: **−8 to +8**, declared in `openenv.yaml`.
+| Tier | Spread scale | Episode length | Approx. reward ceiling |
+|---|---|---|---|
+| Easy | 1.00× | 80 | +8 |
+| Medium | 0.70× | 150 | +7 |
+| Hard | 0.55× | 300 | +6 |
 ---
 ## Three Difficulty Tiers
 ### Task 1 — Easy: Flatland Grass Fire
+15×15 flat grid · single ignition · constant wind · no smoke or fog-of-war · 4 crews, 1 tanker, 15 firebreak cells · 80 steps. **Focus:** basic deployment and perimeter control.
 ### Task 2 — Medium: Canyon Terrain with Wind Shifts
+25×25 mixed terrain · two ignition points · variable wind · smoke occlusion · sensor noise · 5 crews, 2 tankers, 20 firebreak cells, 1 recon · 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
 ### Task 3 — Hard: Wildland-Urban Interface Crisis
+40×40 terrain with roads, rivers, urban zones · staggered ignitions (step 30) · scripted crew casualty (step 40) · fog-of-war (radius 7) · aggressive wind shifts · 6 crews, 3 tankers, 30 firebreak cells, 3 recon · 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
 ---
 ## Fire Spread Model
+A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
+```
+P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
+            × (1 − moisture) × (1 − suppression) × tier_scale
 ```
+| Factor | Effect |
+|---|---|
+| `base_rate` | Baseline spread by fuel type |
 | `fuel_factor` | Fuel load of the target cell |
+| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
+| `slope_factor` | Faster uphill, slower downhill |
+| `moisture` | Wet ground / recent rain reduces ignition probability |
+| `suppression` | Crew presence and retardant coverage reduce spread |
+| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
+Burning cells progress through `BURNING → EMBER → BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
 ---
+## Results
+> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers are produced by `python scripts/eval_trained_model.py --num-seeds 15` on held-out seeds 200–214 (no overlap with training seeds 0–99).
+| Agent | Easy (mean ± std) | Medium (mean ± std) | Hard (mean ± std) |
+|---|---|---|---|
+| Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
+| Heuristic | **+7.53 ± 0.08** | **+6.31 ± 2.77** | +4.74 ± 3.79 |
+| **Trained Qwen-2.5-7B (ours)** | **{TBD}** | **{TBD}** | **{TBD}** |
+| **Δ vs. Heuristic** | **{TBD}** | **{TBD}** | **{TBD}** |
+**Auxiliary metrics for the trained agent** (filled in post-eval):
+| Metric | Easy | Medium | Hard |
+|---|---|---|---|
+| JSON success rate | {TBD} | {TBD} | {TBD} |
+| Mean population saved % | {TBD} | {TBD} | {TBD} |
+| Crew casualty rate | {TBD} | {TBD} | {TBD} |
+> See `scripts/trained_results.json` (post-eval) for the raw scores.
+---
+## Training
+We use a two-stage recipe:
+1. **SFT warm-up** — generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
+2. **GRPO (TRL `GRPOTrainer`)** — start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy → medium → hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
+**Hardware:** A10G Large (24 GB) on a Hugging Face Space JupyterLab session.
+**Training stack:** `unsloth` (4-bit QLoRA), `trl==0.15.2`, `datasets==3.4.1`, `transformers`, `peft`, `wandb`. Pinned in [`training/requirements.txt`](training/requirements.txt).
+**Training plots:** dashboard PNG at [`training/training_dashboard.png`](training/training_dashboard.png) (4-panel: episode reward, population-survival rate, containment %, curriculum tier timeline). W&B run: *(link added post-run)*.
+For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
 ---
 ```text
 Wildfire-Containment-Simulator/
 ├── env/
+│   ├── wildfire_env.py       # Main env: reset(), step(), state()
+│   ├── models.py             # Pydantic action/observation/state models
+│   ├── grid.py               # Terrain, smoke, moisture, fog-of-war
 │   ├── fire_spread.py        # Cellular automaton fire propagation
 │   ├── weather.py            # Stochastic weather engine
+│   ├── resources.py          # Crews, tankers, firebreaks, recon
 │   ├── reward.py             # Decomposed step + terminal reward
 │   ├── briefing.py           # OperationalBriefing generation
 │   ├── serialization.py      # Observation → LLM prompt
 │   ├── action_parser.py      # LLM output → Action (3-layer fallback)
+│   ├── rendering.py          # Frame rendering for GIF replays
+│   └── curriculum.py         # CurriculumController (auto-promote/demote)
 ├── agents/
 │   ├── random_agent.py
 │   └── heuristic_agent.py
 ├── graders/
+│   ├── grader_easy.py        # → (total_reward, details_dict)
 │   ├── grader_medium.py
 │   └── grader_hard.py
 ├── scripts/
+│   ├── evaluate.py           # Baseline eval (random + heuristic)
+│   ├── eval_compare.py       # Multi-agent comparison
+│   ├── eval_trained_model.py # Evaluate a trained adapter
+│   ├── generate_sft_data.py  # Build SFT dataset from heuristic rollouts
 │   ├── replay.py             # Render episode as GIF
+│   ├── run_demo.py           # Pitch demo
+│   └── plot_dashboard.py     # 4-panel training curves
 ├── training/
+│   ├── grpo_v2_colab.ipynb   # GRPO notebook (canonical)
+│   ├── sft_colab.ipynb       # SFT warm-up notebook
+│   ├── sft_data.jsonl        # 4,300 SFT examples
+│   ├── requirements.txt      # Training deps (Unsloth, TRL, etc.)
 │   └── README.md
 ├── server/
+│   └── app.py                # FastAPI on port 7860
+├── frontend/                 # Interactive HTML/JS frontend served at /ui/
+├── tests/                    # 41 pytest tests
 ├── demos/                    # GIF/PNG demo assets
+├── openenv.yaml              # OpenEnv environment manifest
+├── Dockerfile                # HF Space build
+├── BLOG.md                   # Long-form write-up
+└── README.md                 # You are here
 ```
 ---
+## Architecture Decisions
+1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, −3 × loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
+2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic — protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
+3. **Two-stage training (SFT → GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
+4. **3-layer action parser.** JSON parse → regex fallback → safe-`idle`. The training loop never breaks on malformed model output.
+5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) → "What broke").
+6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem — every run is byte-for-byte reproducible.
+7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent — TRL, vLLM, an OpenAI-compatible API client, a curl loop — can drive it.
 ---
+## Citation
+If you use this environment, please cite:
+```bibtex
+@misc{wildfire-containment-simulator-2026,
+  title  = {Wildfire Containment Simulator: Long-Horizon Planning and
+            Instruction Following for Disaster-Response LLM Agents},
+  author = {Team Wildfire},
+  year   = {2026},
+  url    = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
+  note   = {Meta OpenEnv Hackathon submission, Theme 2}
+}
 ```
+---
+## License
+[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta × Hugging Face × Scaler hackathon, April 2026.

[External] Meta OpenEnv Hackathon Participant Help Guide.pdf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:eea09524b58bc396e97fb6b82d8e8da28df43fa0030f573470c4756973dbc197
-size 178344

prompts.md DELETED Viewed

@@ -1,644 +0,0 @@
-# Wildfire Containment Simulator — Agent Prompt Sequence
-**Usage:** Feed these prompts **one at a time** to your coding agent (Claude Code or Antigravity). After each prompt finishes, run its acceptance test yourself before moving to the next. Each prompt assumes the prior ones completed successfully.
-**Global context to paste once at the start of every new agent session** (if the agent loses context between prompts):
-> You are working on the Wildfire Containment Simulator — an OpenEnv-compatible RL environment for the Meta × PyTorch × HuggingFace OpenEnv Hackathon finale (April 25–26, 2026). The repo is at `https://github.com/Abrodolph/Wildfire-Containment-Simulator`. Core packages: `env/` (simulation), `agents/` (baselines), `graders/` (one per tier), `scripts/` (evaluation). The env exposes `reset()`, `step()`, `state()` with Pydantic-validated `Action`, `Observation`, `StepResult` models defined in `env/models.py`. Three tiers exist: easy (15×15), medium (25×25), hard (40×40). You can run `pytest`, `python scripts/evaluate.py`, and any other command. Iterate on failures until tests pass. Never skip the acceptance test at the end of each prompt.
----
-## Prompt 1 — Repo Cleanup & Test Scaffolding ✅ DONE
-```
-Clean up repo cruft and set up a test scaffold before we make any functional changes.
-Tasks:
-1. Delete the nested `Wildfire-Containment-Simulator/` directory at repo root (leftover HF Space metadata).
-2. Delete the literal `{env,graders,agents,scripts}` directory at repo root (shell-brace artifact).
-3. Delete all committed `__pycache__/` directories and `*.egg-info/` folders.
-4. Delete the `venv/` directory if it's committed.
-5. Update `.gitignore` to include: `__pycache__/`, `*.egg-info/`, `venv/`, `.venv/`, `*.pyc`, `.pytest_cache/`, `.ruff_cache/`, `checkpoints/`, `results/`.
-6. Consolidate server entry points: keep `server/app.py` as the single source of truth. Update the root `app.py` to be a one-line shim that imports and runs `server.app:main`. Update `Dockerfile` CMD to match.
-7. Create `tests/` directory with `tests/__init__.py` and `tests/conftest.py`. In conftest, add a fixture `fresh_env` that yields a `WildfireEnv()` instance.
-8. Create `tests/test_smoke.py` with three tests:
-   - `test_env_resets_on_all_tiers` — calls `env.reset(task_id=t, seed=42)` for t in ["easy", "medium", "hard"] and asserts obs is not None.
-   - `test_idle_action_never_crashes` — resets env, calls `env.step(Action(action_type=ActionType.IDLE))` 10 times, asserts no exception.
-   - `test_determinism` — runs a fixed 20-step idle rollout twice with seed=42 on easy tier, asserts the final `stats.cells_burned` matches.
-9. Add `pytest` and `pytest-cov` to `requirements.txt` if missing.
-Acceptance test:
-- `pytest tests/ -v` passes with 3 tests green.
-- `python app.py` still starts the server on port 7860.
-- `git status` shows no `__pycache__` or `{env,...}` cruft.
-- Output the diff summary of deleted files and new files.
-```
----
-## Prompt 2 — Reward Restructuring (Decomposed Terminal + Dense Step) ✅ DONE
-```
-Replace the current normalized [0,1] composite reward with a decomposed terminal + dense step structure. This is critical for GRPO training — the current reward is too flat to produce meaningful advantages.
-Read `env/reward.py` first. Understand the current RewardCalculator class. Also read `env/wildfire_env.py` to see how reward is called per step.
-Tasks:
-1. In `env/reward.py`, add a new method `compute_step_reward(prev_state, current_state, action_was_valid, action_was_redundant) -> float` that returns:
-   - (delta_containment_pct * 0.4) + (delta_population_safety * 0.4) + (-0.1 if action_was_redundant else 0.0)
-   - where delta_containment_pct is (current_containment - prev_containment) in [0, 1] units
-   - delta_population_safety is (1 - current_pop_lost/total_pop) - (1 - prev_pop_lost/total_pop)
-   - redundant = same action_type + same target coords as the immediately prior action
-2. Add a method `compute_terminal_reward(final_state, episode_steps, max_steps) -> float`:
-   - start at 0
-   - if all_populations_safe (pop_lost == 0): add +5.0
-   - else: add -3.0 * (pop_lost / total_pop)
-   - if any crew_casualty occurred in the episode: add -2.0 (stacks with above)
-   - efficiency_bonus = (max_steps - episode_steps) / max_steps * 2.0 — ONLY applied if pop_lost == 0
-   - invalid_action_penalty_total = min(0.2, 0.01 * invalid_action_count) — subtract this
-3. In `env/wildfire_env.py`:
-   - Track `self._prev_action` and `self._invalid_action_count` and `self._crew_casualty_occurred` across the episode (reset them in `reset()`).
-   - Replace the current reward computation in `step()` with: step_reward from above, plus terminal_reward ONLY when `done == True`.
-   - The StepResult.reward should be `step_reward + (terminal if done else 0.0)`.
-4. Keep the OLD composite reward accessible as `info["legacy_reward"]` in StepResult for backward compatibility with existing graders. (Graders get updated in Prompt 10.)
-5. Add `tests/test_reward.py` with:
-   - `test_successful_episode_scores_high` — run heuristic agent on easy tier seed=42, assert total reward > +3.0
-   - `test_all_pop_lost_scores_negative` — construct a scenario (or mock state) where all population is lost, assert terminal < -2.0
-   - `test_crew_casualty_stacks` — scenario with pop loss AND crew casualty, assert terminal includes both penalties
-   - `test_redundant_action_penalty` — call the same DEPLOY_CREW twice, assert second call's step_reward includes -0.1
-Acceptance test:
-- `pytest tests/test_reward.py -v` passes all 4 tests.
-- Run `python scripts/evaluate.py 20` on easy tier with the heuristic agent. Report mean + std of total rewards. Successful episodes should cluster in the +5 to +8 range, failed episodes in the -2 to -5 range. If the ranges overlap by more than 20% of episodes, the reward isn't separated enough — report that and DO NOT proceed.
-```
----
-## Prompt 3 — Observation-to-Text Serializer ✅ DONE
-```
-Write a serializer that converts a Pydantic Observation into a structured text prompt that an LLM can reason over. This is required because OpenEnv is an LLM-training framework — the agent is a language model, not a numeric policy.
-Read `env/models.py` to understand the Observation schema. Read the README section "Observation Space" for the intended structure.
-Tasks:
-1. Create `env/serialization.py` with a function `serialize_observation(obs: Observation, step_num: int, max_steps: int) -> str`.
-2. Output format (match this structure exactly — the LLM will be trained on it):
-```
-=== WILDFIRE INCIDENT COMMAND — STEP {step}/{max_steps} ===
-SITUATION:
-- Fire active on {N} cells. Containment: {pct}%. Population at risk: {N} zones.
-- Wind: {speed} km/h {dir} (±{noise} km/h noise). Humidity: {h}%. Rain: {active|inactive}.
-- Last event: {most_recent_event or "None"}
-GRID SUMMARY (smoke-obscured cells marked [?]):
-{bounding_box_descriptions_of_fire_regions}
-{populated_zone_descriptions}
-{firebreak_descriptions_if_any}
-RESOURCES:
-- crew_0: {deployed at (r,c) | undeployed available}. Status: {active|casualty}.
-- crew_1: ...
-- tanker_0: {ready | cooldown N steps remaining}
-- Firebreaks remaining: {N}. Recon flights remaining: {N}.
-RECENT EVENTS:
-- Step {N}: {event description}
-- ... (last 3 events max)
-Available actions: deploy_crew, move_crew, drop_retardant, build_firebreak, recon_flight, idle
-Produce your action as JSON: {"action_type": "...", ...}
-```
-3. Helper functions inside the module (keep private with leading underscore):
-   - `_summarize_grid_regions(obs.grid) -> List[str]` — detect rectangular bounding boxes of (a) active fire cells clustered together, (b) populated cells, (c) built firebreaks. Output as "Row X-Y, Col A-B: description". Cap at 5 regions per category, prioritize by size.
-   - `_format_resources(obs.resources) -> str`
-   - `_format_events(obs.recent_events) -> str`
-4. Add `tests/test_serialization.py`:
-   - `test_serialize_produces_all_sections` — reset env, serialize, assert the output contains "SITUATION:", "GRID SUMMARY:", "RESOURCES:", "RECENT EVENTS:", "Available actions:".
-   - `test_serialize_handles_fog_of_war` — hard tier reset, assert "[?]" appears somewhere in output (smoke or fog-obscured cells).
-   - `test_serialize_length_under_2048_tokens` — run on all 3 tiers, assert `len(tokenizer.encode(output))` < 1800 using tiktoken's cl100k_base (if tiktoken not installed, use `len(text.split()) < 1500` as a proxy).
-Acceptance test:
-- `pytest tests/test_serialization.py -v` passes all 3 tests.
-- Run a manual sanity check: `python -c "from env import WildfireEnv; from env.serialization import serialize_observation; env = WildfireEnv(); obs = env.reset(task_id='medium', seed=42); print(serialize_observation(obs, 0, 150))"` — paste the output and confirm it reads like a realistic incident briefing.
-```
----
-## Prompt 4 — LLM Action Parser with 3-Layer Fallback ✅ DONE
-```
-Build a robust parser that converts LLM text output into a validated Action object. LLMs produce malformed JSON, hallucinated fields, and out-of-range coords — we need to never crash.
-Tasks:
-1. Create `env/action_parser.py` with a function `parse_action(llm_output: str, obs: Observation) -> Tuple[Action, str]` returning the action AND a status string ("json_success", "regex_fallback", "safe_idle").
-2. Three layers, in order:
-   LAYER 1 — Direct JSON parse:
-   - Extract JSON from output using a helper `_extract_json_block(text)` that finds content between first `{` and matching `}` (handles ```json fences, handles leading/trailing text).
-   - Try `json.loads` then `Action(**data)` — Pydantic validates fields.
-   - On success return (action, "json_success").
-   LAYER 2 — Regex extraction:
-   - Search for action_type via regex: `action_type["\s:]+["']?(deploy_crew|move_crew|drop_retardant|build_firebreak|recon_flight|idle)`
-   - Based on detected action_type, extract required fields with regex patterns (e.g., `crew_id["\s:]+["']?(crew_\d+)`, `target_row["\s:]+(\d+)`, `direction["\s:]+["']?(N|S|E|W|NE|NW|SE|SW)`).
-   - Construct Action; if Pydantic validates, return (action, "regex_fallback").
-   LAYER 3 — Safe fallback:
-   - Return `(Action(action_type=ActionType.IDLE, reason="parse_failure"), "safe_idle")`.
-3. Add coordinate sanity check: after any layer succeeds, if target_row or target_col is outside the current grid dimensions (infer from obs.grid shape), downgrade to safe_idle. Never trust LLM-provided coords blindly.
-4. Add `tests/test_action_parser.py` with 8 test cases covering:
-   - Clean JSON output
-   - JSON wrapped in ```json fences
-   - JSON with extra surrounding commentary
-   - Malformed JSON (missing quotes) that regex can save
-   - Completely garbage output → safe_idle
-   - Out-of-bounds coords → safe_idle
-   - Hallucinated action_type (e.g., "nuke_fire") → safe_idle
-   - Empty string → safe_idle
-Acceptance test:
-- `pytest tests/test_action_parser.py -v` passes all 8 tests.
-- Zero crashes across the test suite.
-- Status string is correctly reported for each case.
-```
----
-## Prompt 5 — Replay / GIF Renderer ✅ DONE
-```
-Build a replay script that renders any episode as an animated GIF. This is critical for the storytelling score — every demo asset depends on it.
-Tasks:
-1. Add `imageio` and `matplotlib` to `requirements.txt` if not present.
-2. Create `scripts/replay.py` with CLI: `python scripts/replay.py --tier {easy|medium|hard} --seed {int} --agent {random|heuristic} --output {path.gif}`.
-3. The script should:
-   - Instantiate the env, run the agent, capture the full ground-truth `env.state()` at every step.
-   - For each step, render a matplotlib figure (8x8 inches, 100 dpi) with:
-     * Main panel (80% area): grid colored by cell state. Burning = red (intensity → color saturation), burned = dark gray, populated = blue square outline, firebreak = brown, crew = green circle with crew_id label, tanker drop zone = translucent cyan overlay.
-     * Bottom strip: step number, cells burning, containment %, pop lost, wind arrow + speed.
-   - Save all frames, stitch to GIF at 5 fps, write to output path.
-   - Also save final-frame PNG to same path with `.png` extension.
-4. Keep the rendering code in `env/rendering.py` (importable helpers), not inline in the script. Functions:
-   - `render_frame(state: EnvState, step: int, stats: dict) -> np.ndarray` — returns RGB array.
-   - `render_episode_gif(frames: List[np.ndarray], output_path: str, fps: int = 5)`.
-5. Add `tests/test_rendering.py`:
-   - `test_render_frame_produces_rgb` — reset env on easy, render frame, assert shape is (H, W, 3) and dtype is uint8.
-   - `test_gif_creation` — run 20 steps of random agent, call `render_episode_gif`, assert output file exists and is > 10KB.
-Acceptance test:
-- `pytest tests/test_rendering.py -v` passes both tests.
-- Run: `python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demos/heuristic_medium_42.gif`
-- Open the GIF. Confirm it shows fire spreading, crews moving, and the stats strip updating. Paste the final-frame stats as confirmation.
-```
----
-## Prompt 6 — Curriculum Controller ✅ DONE
-```
-Add a curriculum controller that auto-promotes through tiers based on rolling performance. This produces the characteristic "dip-and-recover" pattern on training curves that makes for compelling demo visuals.
-Tasks:
-1. Create `env/curriculum.py` with class `CurriculumController`:
-   - `__init__(self, start_tier: str = "easy", thresholds: Optional[dict] = None)` — default thresholds: easy→medium at 4.0 avg over 10 eps, medium→hard at 3.5 avg over 10 eps (these are total episode rewards under the new reward scheme, NOT [0,1]).
-   - `after_episode(self, total_reward: float) -> Optional[str]` — returns the new tier name if a promotion just fired, else None.
-   - `get_tier(self) -> str` — current tier.
-   - `get_history(self) -> List[Tuple[int, str, float]]` — list of (episode_idx, tier, reward) for plotting.
-   - `promotion_log: List[Tuple[int, str]]` — list of (episode_idx, new_tier) for marking vertical lines on plots.
-2. Demote behavior: if recent 10-ep avg drops below (threshold * 0.5) after a promotion, demote back. Log this too.
-3. Add `tests/test_curriculum.py`:
-   - `test_promotion_fires_at_threshold` — feed 10 rewards of 5.0, assert promotion to medium.
-   - `test_no_premature_promotion` — feed 5 rewards of 5.0, assert still on easy.
-   - `test_demotion_on_collapse` — promote to medium, then feed 10 rewards of 0.5, assert demoted to easy.
-   - `test_history_tracking` — run 20 episodes, assert history length is 20 and promotion_log is correctly populated.
-Acceptance test:
-- `pytest tests/test_curriculum.py -v` passes all 4 tests.
-- The controller is not yet wired into the env itself (that happens in the training notebook, Prompt 8). This prompt just builds the component.
-```
----
-## Prompt 7 — Eval Comparison Script ✅ DONE
-```
-Build the eval comparison script that generates the headline comparison table for the pitch. This runs multiple agents on fixed seeds and outputs a clean comparison.
-Tasks:
-1. Create `scripts/eval_compare.py` with CLI: `python scripts/eval_compare.py --seeds 42 43 44 45 46 --tiers medium hard --agents random heuristic base_llm trained_llm --output eval_results.json`.
-2. Agent registry — a dict mapping agent name to a factory function:
-   - `random` → existing RandomAgent
-   - `heuristic` → existing HeuristicAgent
-   - `base_llm` → LLM agent using `env/serialization.py` + `env/action_parser.py`, calling a base model (stub this for now — read model path from env var `BASE_MODEL_PATH`, default to None which skips this agent with a warning).
-   - `trained_llm` → same pattern, env var `TRAINED_MODEL_PATH`.
-3. For each (agent, tier, seed) combination:
-   - Run the episode.
-   - Record: final containment_pct, pop_saved_pct (= 1 - pop_lost/total_pop), total_reward, episode_steps.
-4. Output:
-   - A JSON file at the specified path with full results.
-   - A printed table to stdout formatted like:
-     ```
-     === EVAL RESULTS — Medium Tier (5 seeds) ===
-                         Containment   Pop Saved   Reward   Steps
-     Random Agent            41%          60%       -1.2     150
-     Heuristic Agent         49%          71%       +1.8     143
-     Base LLM (Qwen)         38%          55%       -0.9     150  [skipped — no model]
-     Trained LLM (ours)      67%          89%       +4.1     121  [skipped — no model]
-     ```
-   - Use mean across seeds for each column. Mark skipped agents clearly.
-5. Add `--quick` flag that runs only easy tier with 2 seeds for smoke testing.
-6. Add `tests/test_eval_compare.py`:
-   - `test_quick_mode_runs` — invoke with --quick, assert eval_results.json exists, assert at least random and heuristic have non-null entries.
-Acceptance test:
-- `python scripts/eval_compare.py --quick` completes in under 2 minutes.
-- `pytest tests/test_eval_compare.py -v` passes.
-- Full run `python scripts/eval_compare.py --seeds 42 43 44 45 46 --tiers medium hard --agents random heuristic` produces the table. The LLM columns will show "[skipped — no model]" which is expected at this stage.
-```
----
-## Prompt 8 — GRPO Training Notebook (Colab) ✅ DONE
-```
-Build the GRPO training notebook. This is a hackathon minimum requirement — without it we're technically DQ'd.
-Tasks:
-1. Create `training/grpo_colab.ipynb` (a Jupyter notebook — JSON format). Use `nbformat` to construct it programmatically to avoid JSON escaping errors.
-2. Notebook sections (each a separate cell with a markdown header cell above it):
-   **Section 1: Setup**
-   - pip install: `unsloth trl openenv-core pydantic numpy imageio matplotlib`
-   - Clone the repo or install from path.
-   - Import FastLanguageModel from unsloth, load `unsloth/Qwen2.5-1.5B-Instruct` in 4-bit with max_seq_length=2048.
-   - Apply LoRA: r=16, alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"].
-   **Section 2: Environment & Rollout**
-   - Import WildfireEnv, serialize_observation, parse_action.
-   - Define `collect_rollout(env, model, tokenizer, tier, seed) -> List[Dict]` that:
-     * resets env
-     * for each step: serializes obs → generates completion → parses action → steps env → records (prompt, completion, reward, step_status).
-     * returns trajectory list.
-   - Define `system_prompt` — a short, firm instruction to always output action as JSON only.
-   **Section 3: GRPO Training Loop**
-   - Use TRL's GRPOTrainer. Config: num_generations=8 per prompt, learning_rate=5e-6, max_steps=50, save_steps=10, per_device_train_batch_size=1, gradient_accumulation_steps=4.
-   - Reward function: for each generation, run a mini-rollout (fresh env, same seed, sample action from the completion) and return the 1-step reward + discounted terminal if done. Cache seeds so generations for the same prompt see the same env state.
-   - Wire in the CurriculumController from Prompt 6: after each full episode, call `controller.after_episode(total_reward)` and switch tier for the next episode.
-   **Section 4: Checkpointing & Recovery**
-   - Save LoRA adapter to `./checkpoints/step_{N}` every 10 steps.
-   - Save a JSON of training stats (step, mean_reward, tier, parse_failure_rate) to `./training_stats.json` every step.
-   - Add a "resume from checkpoint" cell at the top of Section 3 that loads the latest checkpoint if present.
-   **Section 5: Plot Reward Curve**
-   - Load training_stats.json, plot mean_reward vs step with matplotlib.
-   - Save as `reward_curve.png`.
-   - Mark tier promotions as vertical lines using controller.promotion_log.
-3. Add `training/README.md` with:
-   - How to open in Colab (a badge link).
-   - Which cells to run in order.
-   - Expected runtime on T4: ~45 min for 50 steps.
-   - How to download the trained adapter.
-4. Add `training/test_notebook_imports.py` — a plain Python file (not pytest) that imports every module the notebook uses and instantiates the env + tokenizer (skipping the model load). This catches broken imports before you open Colab.
-Acceptance test:
-- `python training/test_notebook_imports.py` runs without error.
-- Open the notebook locally with `jupyter nbconvert --to notebook --execute training/grpo_colab.ipynb --ExecutePreprocessor.timeout=600` — skip this if no GPU available locally (which is expected). Instead, validate notebook JSON with `jupyter nbconvert --to script training/grpo_colab.ipynb` and confirm the generated .py file has no syntax errors.
-- Confirm the notebook has exactly the 5 sections described, each with a markdown header.
-```
----
-## Prompt 9 — Training Curves Dashboard ✅ DONE
-```
-Build a 4-panel training dashboard. Panel D (curriculum transitions) is the storytelling hook.
-Tasks:
-1. Create `scripts/plot_dashboard.py` with CLI: `python scripts/plot_dashboard.py --stats training/training_stats.json --output training/training_dashboard.png`.
-2. Layout: 2x2 matplotlib grid, figsize=(12, 8), dpi=100.
-   - Panel A (top-left): Mean episode reward vs training step. Line plot with moving average (window=5) as a thicker overlay.
-   - Panel B (top-right): Population survival rate (% of eps with zero pop loss) vs training step. Computed as rolling 10-ep fraction.
-   - Panel C (bottom-left): Mean containment % at episode end, vs training step.
-   - Panel D (bottom-right): Curriculum tier timeline. X-axis = episode index, Y-axis = tier (easy=0, medium=1, hard=2) drawn as a step function. Vertical dashed lines at promotion events with tier labels.
-3. Handle missing data gracefully — if `training_stats.json` is absent, generate a synthetic stats file at `training/synthetic_stats_demo.json` with 50 fake training steps showing a plausible upward curve + one tier promotion, then plot from that (clearly label it "SYNTHETIC DEMO" in the figure title). This lets us test the plot script without a real training run.
-4. Add `tests/test_dashboard.py`:
-   - `test_synthetic_dashboard` — run `plot_dashboard.py` with no stats file, assert the synthetic PNG is created and > 50KB.
-Acceptance test:
-- `pytest tests/test_dashboard.py -v` passes.
-- Open the generated PNG. Confirm all 4 panels render, Panel D has visible vertical promotion lines, and the synthetic warning label is visible.
-```
----
-## Prompt 10 — Grader Alignment & Legacy Reward Cleanup ✅ DONE
-```
-The existing graders in graders/ were written against the old [0,1] composite reward. Align them with the new decomposed reward so eval numbers are consistent between training and grading.
-Tasks:
-1. Read `graders/grader_easy.py`, `graders/grader_medium.py`, `graders/grader_hard.py`. Identify where each one reads `result.reward` or computes a final score.
-2. Update each grader:
-   - Sum step rewards + terminal reward across the episode using the new decomposed structure.
-   - Return the total episode reward as the grader's score.
-   - Add a `details` dict to the grader return value: `{"total_reward": float, "containment_pct": float, "pop_saved_pct": float, "steps": int, "crew_casualty": bool}`.
-3. Remove any references to `legacy_reward` that are no longer needed. Keep `legacy_reward` in StepResult.info for one more cycle (delete later), but graders should NOT use it.
-4. Update `scripts/evaluate.py` to print the new detailed metrics alongside the reward.
-5. Update `README.md` "Baseline Scores" table with the new reward scale. Re-run `python scripts/evaluate.py 5` and paste the new numbers. Expected pattern: heuristic should now clearly beat random on ALL tiers under the new reward. If it doesn't, flag it — this is a diagnostic signal that the reward or the heuristic needs work.
-6. Add `tests/test_graders.py`:
-   - `test_each_grader_returns_float_and_details` — run each of the 3 graders with the heuristic agent, assert return structure.
-   - `test_grader_scores_are_in_expected_range` — assert easy total_reward > 3.0 for heuristic, medium > 1.0, hard > 0.0 (generous lower bounds).
-Acceptance test:
-- `pytest tests/test_graders.py -v` passes.
-- `python scripts/evaluate.py 5` produces a table where heuristic beats random on every tier. Paste the output. If heuristic loses on any tier, investigate before proceeding — this likely indicates a variance or reward issue.
-- README "Baseline Scores" section is updated with new numbers.
-```
----
-## Prompt 11 — Demo Seed Finder + Demo Runner ✅ DONE
-```
-Find a fixed seed that produces a clean, visually obvious contrast between heuristic and (eventually) trained-LLM behavior on medium tier. This becomes the 3-minute pitch demo.
-Tasks:
-1. Create `scripts/find_demo_seed.py`:
-   - Iterate seeds 0..500 on medium tier.
-   - For each seed, run the HEURISTIC agent, record: total_reward, pop_saved_pct, wind_shift_step (if any), and whether at least one populated cell was lost.
-   - Filter for seeds where: (a) a wind shift fires between step 60–90, (b) heuristic loses at least one populated cell, (c) heuristic total_reward is between 0.0 and +2.0 (i.e., a flawed but not catastrophic baseline �� gives room for improvement).
-   - Output top 5 candidate seeds to `demos/candidate_seeds.json` with a short description of each.
-2. Create `scripts/run_demo.py` with CLI: `python scripts/run_demo.py --seed {int}`:
-   - Runs heuristic on medium tier with that seed, generates GIF to `demos/heuristic_demo.gif` using the Prompt 5 renderer.
-   - Prints a play-by-play narrative: "Step 45: fire approaches populated cell (12, 8). Step 60: wind shifts. Step 75: crew committed to wrong flank. Step 89: populated cell burns."
-   - If `--agent trained_llm` is passed and a TRAINED_MODEL_PATH env var exists, also runs the trained model and saves `demos/trained_demo.gif` + a second narrative.
-   - Print a clean side-by-side comparison at the end: both agents' final stats.
-3. Pick ONE seed from the top 5 as `DEMO_SEED`. Hardcode it as a constant in `scripts/run_demo.py`: `DEMO_SEED = <chosen_seed>`. The `--seed` flag defaults to this. Document the narrative for this specific seed in a comment block at the top of the file.
-4. Add `demos/README.md` explaining how to regenerate demo assets.
-Acceptance test:
-- `python scripts/find_demo_seed.py` completes in under 10 minutes, outputs candidate_seeds.json.
-- `python scripts/run_demo.py` (with default seed) produces heuristic_demo.gif and prints a coherent narrative. Confirm the narrative matches what the GIF actually shows.
-- Paste the chosen DEMO_SEED value.
-```
----
-## Prompt 12 — Theme 2 Framing: Operational Briefing System
-```
-Add a structured operational briefing that the env produces on reset(). The agent receives this as part of its first observation. This is what pivots the environment into Theme 2 (Long-Horizon Planning & Instruction Following) framing — judges need to see instruction-following as a first-class feature.
-Tasks:
-1. Create `env/briefing.py` with:
-   - Pydantic model `OperationalBriefing` with fields: `incident_id: str`, `ignition_cause: str`, `priority_populated_zones: List[Tuple[int, int]]` (cells the agent must prioritize protecting), `priority_infrastructure: List[Tuple[int, int]]` (e.g., road cells, optional), `forecast_events: List[str]` (e.g., "Wind shift southwest expected by step 60"), `declared_time: str` (narrative time like "04:00").
-   - Function `generate_briefing(tier_config, rng) -> OperationalBriefing` that synthesizes a plausible briefing from the tier config. For populated priorities, pick the top 2 largest pop clusters. For forecast events, derive from the weather schedule if the engine exposes scheduled wind shifts; otherwise generate 1-2 plausible generic forecasts.
-   - Function `briefing_to_text(briefing: OperationalBriefing) -> str` — formats as a natural-language briefing block:
-     ```
-     === OPERATIONAL BRIEFING ===
-     Incident {incident_id} declared at {declared_time}.
-     Cause: {ignition_cause}.
-     PRIORITY 1: Protect populated zones at {coords list with cell names}.
-     PRIORITY 2: Maintain {infrastructure} open where possible.
-     FORECAST:
-     - {forecast_1}
-     - {forecast_2}
-     Commander's intent: Contain fire with zero civilian casualties. Preserve crew safety.
-     ```
-2. Update `env/models.py`:
-   - Add `briefing: Optional[OperationalBriefing]` field to `Observation`. Populated only on the first observation after reset; subsequent observations can reuse or omit.
-3. Update `env/wildfire_env.py`:
-   - On reset, generate a briefing and attach to the first observation.
-   - Store `self.active_briefing` for the episode so reward logic can reference it.
-4. Update `env/reward.py` compute_terminal_reward:
-   - Add a `briefing_adherence_bonus`: +1.0 if all priority_populated_zones survived, 0 otherwise.
-   - Stack this on top of the existing terminal reward.
-5. Update `env/serialization.py` serialize_observation:
-   - If `obs.briefing` is present, prepend `briefing_to_text(obs.briefing)` above the SITUATION block.
-   - Subsequent steps: include a shortened reminder like "Priority zones: (r1,c1), (r2,c2) — still standing" or "— 1 LOST".
-6. Add `tests/test_briefing.py`:
-   - `test_briefing_generated_on_reset` — reset on medium, assert obs.briefing is not None and has ≥1 priority zone.
-   - `test_briefing_adherence_bonus` — run heuristic successfully saving priority zones, assert terminal includes the +1.0.
-   - `test_briefing_in_serialized_prompt` — serialize first obs, assert "OPERATIONAL BRIEFING" substring is present.
-Acceptance test:
-- `pytest tests/test_briefing.py -v` passes all 3 tests.
-- Run the serializer manually on a fresh medium reset and confirm the briefing reads coherently. Paste the output.
-- Re-run `python scripts/evaluate.py 5`. Reward numbers will shift slightly due to the new bonus — that's expected. Paste the new numbers.
-```
----
-## Prompt 13 — README Rewrite for Finale Framing
-```
-Rewrite the README to frame this as a finale submission aligned with Theme 2 (Long-Horizon Planning & Instruction Following). Keep all the technical depth but re-lead with the finale narrative.
-Tasks:
-1. Replace the current README.md top section (above "Real-World Motivation") with:
-```markdown
-# Wildfire Containment Simulator
-**OpenEnv Finale Submission — Theme 2: Long-Horizon Planning & Instruction Following**
-![Training Demo](demos/heuristic_demo.gif)
-A partially-observable disaster simulation where an LLM acts as Incident Commander, interpreting operational briefings, tracking state across 300-step episodes, and recovering from cascading failures. Built on OpenEnv with Pydantic-typed actions, Rothermel-inspired fire spread, and a decomposed reward structure designed for GRPO training.
-**Headline result:** Our trained Qwen-2.5-1.5B IC achieves {X}% population survival on Hard tier vs. {Y}% for the rule-based heuristic baseline. See [HF blog post]({link}) for details.
-## Quick Links
-- 🔥 **HF Space (live env):** {link}
-- 📒 **Training notebook (Colab):** [training/grpo_colab.ipynb]({link})
-- 📊 **Eval results:** [eval_results.json]({link})
-- 🎬 **Demo:** `python scripts/run_demo.py`
-- 📝 **Blog post:** {link}
-```
-2. Add a new section right after the quick links called **"Why Theme 2"**:
-   - 3 bullets explaining long-horizon planning (300 steps, sparse terminal reward), instruction following (operational briefings), and recovery from early mistakes (staggered ignitions, crew loss events).
-3. Keep all existing sections (Environment API, Action Space, Observation Space, Reward Function, Tiers, Fire Spread Model, Project Structure, Key Design Decisions).
-4. Update the **Reward Function** section to describe the new decomposed structure (step rewards + terminal spikes), not the old [0,1] composite.
-5. Add a new **"Baseline Scores"** table with post-training numbers. If training hasn't completed yet, use placeholder `{TBD}` and add a prominent note: "Numbers will be updated post-training on April 24."
-6. Add a **"Reproducing Our Results"** section:
-   - How to run baseline evals.
-   - How to open the Colab notebook.
-   - How to run the demo seed.
-   - How to render replays.
-Acceptance test:
-- README renders cleanly on GitHub (preview via VSCode or `grip`).
-- All links are either live or clearly marked as placeholders.
-- The first screenful (hero + quick links + theme justification) is self-contained — a judge can get the pitch in 30 seconds without scrolling.
-```
----
-## Prompt 14 — CI & Final Repo Polish
-```
-Add CI and final-mile polish. This is the "looks professional on GitHub" pass.
-Tasks:
-1. Create `.github/workflows/ci.yml`:
-   - Triggers: push to main, PRs.
-   - Runs: setup Python 3.10, install requirements, run `pytest tests/ -v --cov=env --cov-report=term`.
-   - Cache pip dependencies.
-   - Required checks: all tests pass.
-2. Add a coverage badge and CI badge to the top of README (below the title):
-   ```
-   ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
-   ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
-   ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
-   ```
-3. Create `LICENSE` file with MIT license (to match the README frontmatter).
-4. Audit `openenv.yaml` against the latest OpenEnv spec — fetch the latest spec from the openenv repo (github.com/meta-pytorch/openenv if that's the canonical URL at the time of writing) and verify field names, required properties, and schema version. Report any discrepancies and fix them.
-5. Clean up `pyproject.toml`:
-   - Pin Python to `>=3.10`.
-   - Ensure all console_scripts point to existing entry points (no dead references).
-   - Move `pytest`, `pytest-cov` to `[project.optional-dependencies]` under a `dev` extra.
-6. Add `CONTRIBUTING.md` (brief — 15 lines is fine) explaining how to add a new tier, how to add a new action type, and where tests live.
-7. Run `python -c "import env; import server; from env.wildfire_env import WildfireEnv; WildfireEnv().reset(task_id='easy', seed=0)"` as a final smoke test.
-Acceptance test:
-- CI badge appears (may show pending until the first push).
-- `pytest tests/ -v --cov=env` runs clean locally and reports >60% coverage on env/.
-- OpenEnv spec audit is completed — paste any discrepancies found and confirm they're fixed.
-- Repo root looks clean: no `__pycache__`, no `{env,...}` artifacts, no nested duplicate folder.
-```
----
-## Prompt 15 (OPTIONAL — Only if P1 complete by April 24 evening) — Multi-Agent Crew Architecture
-```
-OPTIONAL: Only execute this if Prompts 1-14 are complete AND the training run has produced a working reward curve. Otherwise skip — this is a high-risk refactor close to deadline.
-Convert crews from passive tools into semi-autonomous sub-agents. This legitimizes the Halluminate sub-theme claim (Theme 1: Multi-Actor Environments) as a secondary pitch angle.
-Tasks:
-1. Add `local_observation` method to Crew in `env/resources.py`:
-   - Returns a 3×3 neighborhood view centered on the crew's position (fire_state, intensity, smoke), plus crew's own health state.
-2. Add a `local_policy` function per crew:
-   - Rule-based: if intensity at current cell > 0.8, retreat one cell away from fire center. Otherwise move toward nearest visible fire in the 3×3 window. If no fire visible, hold position.
-   - Crews execute this policy automatically each step UNLESS the IC's most recent order overrides.
-3. Change IC action space:
-   - Keep existing `MOVE_CREW(crew_id, direction)` but re-label semantically as `ORDER_CREW_MOVE`.
-   - Add `ORDER_CREW_OBJECTIVE(crew_id, objective: Literal["hold", "advance", "retreat", "prioritize_north", "prioritize_south", "prioritize_east", "prioritize_west"])` — the crew's local policy then biases toward that objective.
-   - If the IC issues no order in a given step, crews follow their local_policy autonomously.
-4. Reward impact:
-   - Add tracking for "autonomous saves" — when a crew retreats on its own local_policy and avoids a casualty that would have otherwise occurred. Log these; they become a talking point ("our crews saved themselves 3 times in this episode without IC instruction").
-5. Add `tests/test_multi_agent.py`:
-   - `test_crew_retreats_from_high_intensity` — construct scenario with intensity spike at crew's cell, assert crew moves away next step even with no IC order.
-   - `test_ic_order_overrides_local_policy` — assert `ORDER_CREW_MOVE` still works when issued.
-   - `test_autonomous_save_tracking` — count autonomous_saves after a scripted scenario.
-6. Update `env/serialization.py` to include crew local observations in the prompt under a new `CREW REPORTS` section (each crew reports what they see and what they're doing).
-7. Update README to add a "Multi-Agent Architecture" section describing the IC/crew decomposition.
-Acceptance test:
-- `pytest tests/test_multi_agent.py -v` passes all 3 tests.
-- Run `python scripts/run_demo.py` — confirm the narrative now includes autonomous crew moments.
-- If ANYTHING breaks the existing test suite, revert the changes immediately. This prompt must not destabilize P1 deliverables.
-```
----
-## Final Checklist (Run Before Submission)
-Run these commands sequentially. All must pass.
-```bash
-# 1. All tests green
-pytest tests/ -v
-# 2. Baseline eval produces expected pattern
-python scripts/evaluate.py 5
-# 3. Eval comparison runs
-python scripts/eval_compare.py --seeds 42 43 44 45 46 --tiers medium hard --agents random heuristic
-# 4. Demo runs cleanly
-python scripts/run_demo.py
-# 5. Dashboard generates
-python scripts/plot_dashboard.py --stats training/training_stats.json --output training/training_dashboard.png
-# 6. Replay generates
-python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demos/heuristic_medium_42.gif
-# 7. Notebook imports work
-python training/test_notebook_imports.py
-# 8. Env still serves
-python app.py &
-sleep 3
-curl http://localhost:7860/health
-kill %1
-```