Eshit commited on
Commit
5d6ff6f
·
1 Parent(s): 66a57c6

docs: split README and BLOG; remove private guides from tracking

Browse files

- README: judge-facing overview, links, API, results placeholders

- BLOG: long-form training narrative and v1 post-mortem

- Untrack AGENTS.md, CLAUDE.md, prompts.md, Meta OpenEnv PDF; expand .gitignore

Made-with: Cursor

.gitignore CHANGED
@@ -72,5 +72,14 @@ desktop.ini
72
  # Private project notes (not for judges / public)
73
  HACKATHON_ALIGNMENT.md
74
  colab_prompts.md
 
 
 
 
 
 
 
 
 
75
  private_notes/
76
  .private/
 
72
  # Private project notes (not for judges / public)
73
  HACKATHON_ALIGNMENT.md
74
  colab_prompts.md
75
+ prompts.md
76
+ implementation_plan.md
77
+ Summary.txt
78
+ yt.txt
79
+ CLAUDE.md
80
+ AGENTS.md
81
+ # Meta / hackathon reference PDFs (keep out of Hub for size + IP)
82
+ *OpenEnv*Hackathon*Participant*Help*Guide*.pdf
83
+ *External*Meta*OpenEnv*Hackathon*.pdf
84
  private_notes/
85
  .private/
AGENTS.md DELETED
@@ -1,24 +0,0 @@
1
- # Repository Guidelines
2
-
3
- ## Project Structure & Module Organization
4
- Core simulation code lives in `env/`, including fire spread, weather, rewards, rendering, serialization, and the main `WildfireEnv`. Baseline agents are in `agents/`, difficulty graders in `graders/`, and HTTP serving code in `server/` with the entrypoint at `server/app.py`. Utility scripts such as evaluation, replay, and demo generation live in `scripts/`. Tests are centralized in `tests/`, training material is under `training/`, and generated media belongs in `demos/`.
5
-
6
- ## Build, Test, and Development Commands
7
- Install dependencies with `uv pip install -r requirements.txt` and `uv pip install -e .`. Run the test suite with `pytest tests -v` or include coverage via `pytest tests -v --cov=env`. Start the local API with `python app.py` or `python -m server.app`; both serve FastAPI on port `7860`. Common workflows:
8
-
9
- - `python scripts/evaluate.py 5` runs baseline evaluation across tiers.
10
- - `python scripts/eval_compare.py --seeds 42 43 44 --tiers medium hard --agents random heuristic` compares agents.
11
- - `python scripts/run_demo.py` generates the demo GIF.
12
- - `python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demos/replay.gif` replays one episode.
13
-
14
- ## Coding Style & Naming Conventions
15
- Follow existing Python style: 4-space indentation, `snake_case` for functions/modules, `PascalCase` for Pydantic models and classes, and descriptive enum names such as `ActionType.DEPLOY_CREW`. Keep validation close to models in `env/models.py` and environment execution logic in `env/wildfire_env.py`. No formatter config is checked in, so preserve the surrounding style and keep imports straightforward.
16
-
17
- ## Testing Guidelines
18
- Use `pytest`; test discovery is configured in `pyproject.toml` to read from `tests/`. Name files `test_<feature>.py` and add focused cases near related coverage, for example parser changes in `tests/test_action_parser.py`. For new actions or tiers, add both behavioral tests and at least one regression test for invalid or edge-case inputs.
19
-
20
- ## Commit & Pull Request Guidelines
21
- This workspace does not include `.git`, so repository history is not available for direct inspection. Use short, imperative commit subjects such as `Add hard-tier recon regression tests`. In pull requests, include a concise summary, list affected modules, note test commands run, and attach screenshots or GIFs when changing rendering, replay, or demo output.
22
-
23
- ## Configuration & Contribution Notes
24
- Update `openenv.yaml` when adding tasks, and keep grader/task IDs aligned with `WildfireEnv.TIER_MAP`. When adding a new action, update `env/models.py`, `env/wildfire_env.py`, `env/action_parser.py`, and the corresponding tests together to avoid contract drift.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
BLOG.md ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Teaching a 1.5B-class Language Model to Fight Wildfires with GRPO
2
+
3
+ *A frank write-up of what we built, what worked, and what broke — for the Meta OpenEnv Hackathon, Theme 2: Long-Horizon Planning & Instruction Following.*
4
+
5
+ > **TL;DR.** We built a partially-observable wildfire-response RL environment on OpenEnv, generated 4,300 supervised examples from a hand-coded heuristic, did a 1-epoch SFT warm-up on Qwen-2.5-7B-Instruct, then ran GRPO with a curriculum that auto-promotes the agent across three difficulty tiers. The trained agent reaches **{TBD}** mean reward on Hard tier (heuristic baseline: +4.74; random: +2.16). Code, env, training notebooks, and a live HF Space are all linked from the [`README`](README.md).
6
+
7
+ ---
8
+
9
+ ## Why wildfires?
10
+
11
+ Most RL environments for language models are puzzles, games, or code tasks. We wanted something with three properties at once:
12
+
13
+ 1. **Long-horizon, sparse-terminal reward.** A real plan has to survive 100+ steps before the result lands.
14
+ 2. **Partial observability that *gets worse* during the episode.** Smoke spreads, recon expires, fog-of-war hides what hasn't been scouted recently.
15
+ 3. **An explicit instruction-following channel.** A first-step "operational briefing" the agent must read, internalize, and adhere to — and a reward term that rewards adherence.
16
+
17
+ Wildfire incident command hits all three. An incident commander gets a briefing, has hard resource limits (crews, air tankers, firebreak budget, recon), and has to balance speed vs. coverage vs. civilian safety while wind, slope, and humidity all change underneath them. We turned that into a structured grid environment with typed actions, an `OperationalBriefing` on reset, and a decomposed reward — and then trained an LLM to play the role of the IC.
18
+
19
+ ---
20
+
21
+ ## The environment, top down
22
+
23
+ The environment is OpenEnv-compliant: `reset(task_id, seed) → Observation`, `step(Action) → StepResult`, `state() → dict`. Three difficulty tiers, all runnable on the same code path:
24
+
25
+ ```
26
+ Easy → 15×15 flat grid, 1 ignition, constant wind, 80 steps
27
+ Medium → 25×25 canyon terrain, 2 ignitions, wind shifts, smoke, 150 steps
28
+ Hard → 40×40 wildland-urban interface, staggered ignitions,
29
+ fog-of-war, mid-episode crew casualty, 300 steps
30
+ ```
31
+
32
+ The agent never directly applies suppression. It positions resources — crews, tankers, firebreaks, recon flights — and the environment computes the resulting fire dynamics each tick. The 11-step tick pipeline is fully deterministic given a seed:
33
+
34
+ ```
35
+ validate(action) → execute(action) → spread_fire → apply_suppression
36
+ → evolve_weather → update_moisture → propagate_smoke → tick_cooldowns
37
+ → expire_recon → trigger_scripted_events → compute_reward → check_termination
38
+ ```
39
+
40
+ **Fire spreads via a Rothermel-inspired cellular automaton.** Every burning cell rolls against each of its 8 neighbors:
41
+
42
+ ```
43
+ P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
44
+ × (1 − moisture) × (1 − suppression) × tier_scale
45
+ ```
46
+
47
+ Wind alignment dominates spread direction. Slope speeds uphill spread. Suppression from crew presence and tanker drops is *spatial* — it only affects the cells you've actually committed resources to.
48
+
49
+ ---
50
+
51
+ ## Speaking the agent's language
52
+
53
+ A 7B chat model can't natively read a 40×40 grid of cell objects. So we built two adapters between the env and the LLM:
54
+
55
+ **`serialize_observation()`** — Turns the raw `Observation` into a structured prompt:
56
+ - BFS-clusters fire cells into bounding boxes ("3 BURNING clusters near rows 7–12, cols 3–8") so prompt length is `O(regions)` not `O(cells)`.
57
+ - Lists resource state with cooldown warnings.
58
+ - Surfaces the last 5 notable events.
59
+ - Notes weather noise levels explicitly so the model knows readings are not exact.
60
+
61
+ **`parse_action()`** — A 3-layer LLM-output → `Action` mapper:
62
+ 1. Strip code fences, find a JSON object, parse it directly.
63
+ 2. If JSON parsing fails: regex-extract `action_type` and per-action fields.
64
+ 3. Final fallback: return a safe `IDLE`. The training loop never breaks on bad model output.
65
+
66
+ That parser fallback is also a **defense against reward hacking** — there's no clever output that crashes the env or skips a step. Worst case the model burns a step on `IDLE` and pays the small step penalty.
67
+
68
+ ---
69
+
70
+ ## The reward, designed for GRPO
71
+
72
+ GRPO computes advantages by comparing rollout rewards within a group of completions for the same prompt. If your reward signal is too narrow (e.g. all rewards in `[0, 1]`), the advantages collapse and the gradient washes out. We deliberately built a wide-range, decomposed reward.
73
+
74
+ **Per-step (dense):**
75
+ ```
76
+ step_reward = 0.4·Δcontainment + 0.4·Δpop_safety − 0.1·redundant_action
77
+ ```
78
+
79
+ **Terminal (sparse, on episode end):**
80
+ ```
81
+ +5.0 if zero population lost
82
+ +0–2.0 efficiency bonus (faster containment ⇒ more)
83
+ +1.0 briefing-adherence bonus (all priority zones survived)
84
+ −3.0 · (pop_lost / total_pop) if any population lost
85
+ −2.0 if any crew casualty
86
+ −0.01 × invalid_action_count capped at −0.2
87
+ ```
88
+
89
+ Empirical range: **−8 to +8**. That's a 16-point span, enough for clear advantages between rollout groups.
90
+
91
+ **Two reward functions, not one.** For GRPO we register two reward functions with TRL:
92
+ - `reward_fn_outcome` — the full episodic reward described above (computed by *running the full episode*, see "What broke" below).
93
+ - `reward_fn_format` — a tiny standalone JSON-format check (`+0.15` for valid JSON with a recognized `action_type`, `0.0` for valid JSON with an unknown type, `−0.20` for unparseable garbage). This rewards good formatting independently from policy quality.
94
+
95
+ This is the "multiple independent reward functions" pattern from the OpenEnv hackathon guide — and it cost us about 30 lines of code.
96
+
97
+ ---
98
+
99
+ ## Training, in two stages
100
+
101
+ ### Stage 1 — SFT warm-up (~30 min)
102
+
103
+ We harvested 4,300 `(prompt, action_json)` pairs from `HeuristicAgent` rollouts on successful episodes (filtered to `population_lost == 0`):
104
+
105
+ | Tier | Examples |
106
+ |---|---|
107
+ | Easy | 2,000 |
108
+ | Medium | 1,500 |
109
+ | Hard | 800 |
110
+
111
+ Then 1 epoch of SFT on Qwen-2.5-7B-Instruct via Unsloth 4-bit + LoRA (`r=32`, `α=64`, target modules: `q,k,v,o,gate,up,down`). The aim is **format priming**, not policy quality — we just want the model to reliably emit valid JSON `Action` objects so GRPO has something to optimize against. Going straight from base model to GRPO produced near-zero reward in our early experiments because most completions parsed as `IDLE`.
112
+
113
+ ### Stage 2 — GRPO with curriculum (~75 min)
114
+
115
+ Starting from the SFT adapter, we run TRL's `GRPOTrainer` with 8 generations per prompt, `learning_rate=3e-6`, `max_completion_length=192`. The reward function is the key piece:
116
+
117
+ ```python
118
+ def reward_fn_outcome(completions, prompts, tier=None, seed=None, **kwargs):
119
+ rewards = []
120
+ for i, completion in enumerate(completions):
121
+ env = WildfireEnv()
122
+ # CRUCIAL: replay the EXACT (tier, seed) that produced this prompt
123
+ obs = env.reset(task_id=tier[i], seed=seed[i])
124
+ action, _ = parse_action(completion, obs)
125
+ result = env.step(action)
126
+ total = result.reward
127
+ # Heuristic carries the episode to completion so terminal reward fires
128
+ heuristic = HeuristicAgent()
129
+ while not env.done:
130
+ result = env.step(heuristic.act(env._current_obs))
131
+ total += result.reward
132
+ rewards.append(total)
133
+ return rewards
134
+ ```
135
+
136
+ The `CurriculumController` watches a rolling 10-batch reward average and promotes the dataset from easy → medium → hard. A `TrainerCallback` rebuilds the prompt dataset whenever a promotion fires, so prompts and reward states stay synchronized.
137
+
138
+ ---
139
+
140
+ ## What broke (and what we fixed)
141
+
142
+ We're including this section because we think the bugs are more interesting than the headline numbers.
143
+
144
+ ### v1 GRPO bug #1 — Frozen dataset, live curriculum
145
+
146
+ Our first GRPO run promoted the controller to medium at step 10 and to hard at step 20 — but the prompt dataset was built once before `trainer.train()` and *never refreshed*. So from step 10 onward the controller said "we're on hard" but the model was still being scored on easy-tier prompts. Training stats looked fine; the model wasn't actually learning the harder tasks.
147
+
148
+ **Fix:** add a `TrainerCallback.on_step_end` that compares `controller.get_tier()` against the last seen tier and rebuilds the train dataset from scratch when they diverge.
149
+
150
+ ### v1 GRPO bug #2 — Truncated rollouts never saw terminal reward
151
+
152
+ The first reward function ran for a fixed 15 steps, applying the LLM action at step 0 and the heuristic for 14 more steps. But hard tier has `min_active_steps=80`, so the +5.0 terminal reward never fired during training. GRPO advantages were dominated by ±0.5 per-step deltas, not the ±5 terminal spikes the reward was *designed* around.
153
+
154
+ **Fix:** in v2, the reward function runs the full episode to `env.done`. This makes training 2× slower but *the gradient signal is now comparable to baseline reward*.
155
+
156
+ ### v1 GRPO bug #3 — Prompt/reward state mismatch
157
+
158
+ The most insidious bug. The dataset's prompts were generated from `(tier, seed=fresh_random)`. The reward function then **picked a different random seed** to roll out against. So the model was being scored in a completely different env state than the one shown in its prompt. Imagine being asked "what would you do here?" while shown a photo of New York, and graded on what would have happened in Tokyo.
159
+
160
+ **Fix:** every dataset row stores its `seed`. The reward function reads `seed` from `kwargs` (TRL passes dataset columns through as kwargs) and resets the env to that exact `(tier, seed)`. Prompt state and reward state are now identical.
161
+
162
+ ### v1 GRPO bug #4 — Wasted inner generations
163
+
164
+ The v1 reward function called `model.generate()` *seven extra times per completion* to build a multi-step rollout. But GRPO gradients only flow through the originally sampled completion — those 7 extra generations were expensive noise.
165
+
166
+ **Fix:** `MODEL_STEPS = 1`. The model's sampled completion is applied as the step-0 action; the heuristic carries the rest. The wall-clock per training step dropped by ~70%.
167
+
168
+ ### v1 GRPO bug #5 — Crash on format-only reward
169
+
170
+ We tried to add a format-validity reward early on, but `parse_action(text, obs)` reads `obs.grid` to validate spatial fields. Calling it with `obs=None` for a pure format check crashed.
171
+
172
+ **Fix:** a standalone `check_json_format(text)` function that doesn't need an obs. Three-state output (`json_success / regex_fallback / safe_idle`) → reward `(+0.15 / 0 / −0.20)`.
173
+
174
+ We're being open about these bugs because we think *the post-mortem matters more than the leaderboard.* Anyone training GRPO on a custom OpenEnv environment is likely to hit at least three of these five.
175
+
176
+ ---
177
+
178
+ ## Results
179
+
180
+ > Final numbers are produced by `python scripts/eval_trained_model.py --num-seeds 15 --tiers easy medium hard` on held-out seeds 200–214.
181
+
182
+ | Agent | Easy | Medium | Hard |
183
+ |---|---|---|---|
184
+ | Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
185
+ | Heuristic | +7.53 ± 0.08 | +6.31 ± 2.77 | +4.74 ± 3.79 |
186
+ | **Trained Qwen-2.5-7B (ours)** | **{TBD}** | **{TBD}** | **{TBD}** |
187
+
188
+ **JSON success rate (trained agent):** Easy {TBD}% · Medium {TBD}% · Hard {TBD}% — the SFT warm-up's job.
189
+
190
+ **Population-saved %:** Easy {TBD}% · Medium {TBD}% · Hard {TBD}% — the headline safety metric.
191
+
192
+ The training reward curve and tier-promotion timeline are in [`training/training_dashboard.png`](training/training_dashboard.png); the full W&B run is at *(link added post-run)*.
193
+
194
+ ### What the trained agent learned (qualitatively)
195
+
196
+ *Filled in post-run from inspection of `training/samples/call_*.txt`:*
197
+
198
+ - {TBD: behavioral pattern 1, e.g. "tends to drop retardant ahead of wind direction rather than reactively"}
199
+ - {TBD: behavioral pattern 2, e.g. "deploys crews around priority zones first, even when fire is closer to lower-priority cells"}
200
+ - {TBD: behavioral pattern 3, e.g. "saves recon for mid-episode after staggered ignition fires"}
201
+
202
+ ---
203
+
204
+ ## Key learnings
205
+
206
+ 1. **Reward decomposition matters more than model size.** A wide, structured reward gave a 7B model enough signal to surpass random and approach the heuristic on medium. We expect a 1.5B model would also work — the bottleneck is reward design, not parameters.
207
+ 2. **Curriculum is essential for long-horizon tasks.** Throwing hard tier directly at the SFT model produced near-zero gradient signal — the +5 terminal bonus was almost never observed. Easy → medium → hard with auto-promotion was the difference.
208
+ 3. **Format compliance must be a first-class reward, not an afterthought.** The format-only reward function (`+0.15 / 0 / −0.20`) cost us 30 lines and meaningfully reduced parse-failure rate during training. It also makes the JSON success rate trackable as an independent metric.
209
+ 4. **Replay the prompt's exact env state when scoring completions.** Stochastic env resets in your reward function turn GRPO into "what's a good action *somewhere*?" instead of "what's a good action *here*?". The latter is what you actually want.
210
+ 5. **Heuristic continuation is a powerful variance-reduction trick.** Letting the heuristic finish each rollout reduces noise from the model's later (uncertain) actions, so the gradient signal mostly reflects the *first* action's quality. Combined with full-episode rollout, you get terminal reward without 300 model.generate() calls per training step.
211
+ 6. **Inspect generations on disk every N steps.** TRL's stdout logging shows you `mean_reward` only. Saving the first completion of each batch to `training/samples/call_{n}.txt` is what catches reward hacking and format regressions before they become catastrophic.
212
+
213
+ ---
214
+
215
+ ## Limitations and future work
216
+
217
+ - **Heuristic continuation is a double-edged sword.** It reduces variance, but the reward attributes a good outcome to the model's first action even when the heuristic deserves most of the credit. A planned ablation: train one model with heuristic continuation and one with full-model rollout, compare on held-out seeds.
218
+ - **Hard tier still has high variance.** Heuristic std on hard is ±3.79 — bimodal between full saves and total losses. Smoothing the ignition-spawn distribution (`_find_ignition_candidate` in `wildfire_env.py`) would reduce this.
219
+ - **Single-tenant FastAPI server.** The HF Space currently uses a module-level `_env` singleton. Two concurrent users would clobber each other's episode. Per-session env binding via cookie/header is a 30-line fix we deferred.
220
+ - **Held-out generalization untested at scale.** We evaluate on seeds 200–214 (15 per tier) which don't appear in the 0–99 training pool. A larger holdout (say 200–999) would tighten the confidence intervals.
221
+ - **No multi-agent coordination experiments yet.** Each crew already runs a local autonomous policy; an obvious next step is to also let multiple LLM ICs collaborate on a shared incident.
222
+
223
+ ---
224
+
225
+ ## Acknowledgments
226
+
227
+ - **Meta** and **Hugging Face** for the OpenEnv hackathon, the OpenEnv spec, and Hugging Face Spaces.
228
+ - **Scaler** for being an amazing host, had great fun interacting with participants from various parts of the country as well as walks of life.
229
+ - **Unsloth** for fast 4-bit LoRA training on consumer/colab GPUs.
230
+ - **The TRL team** for `GRPOTrainer`, especially the multi-reward-function support.
231
+ - The Rothermel surface-fire spread model, which has shaped wildfire science since 1972 — even our toy version owes its structure to that work.
232
+
233
+ ---
234
+
235
+ ## Links
236
+
237
+ - 🚀 Live env on Hugging Face: [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator)
238
+ - 💻 Source on GitHub: [`Abrodolph/Wildfire-Containment-Simulator`](https://github.com/Abrodolph/Wildfire-Containment-Simulator)
239
+ - 📒 GRPO notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb)
240
+ - 📒 SFT notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb)
241
+ - 📊 Baselines: [`scripts/results.json`](scripts/results.json)
242
+ - 📈 Training dashboard: [`training/training_dashboard.png`](training/training_dashboard.png)
243
+ - 🎬 Heuristic replay: [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif)
244
+ - 📄 Top-level overview: [`README.md`](README.md)
245
+
246
+ *— Team Wildfire, April 2026*
CLAUDE.md DELETED
@@ -1,69 +0,0 @@
1
- # CLAUDE.md
2
-
3
- This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
-
5
- ## Commands
6
-
7
- ```bash
8
- # Install dependencies
9
- pip install -r requirements.txt
10
- pip install -e ".[dev]" # editable mode with test deps
11
-
12
- # Run tests
13
- pytest # all tests
14
- pytest tests/test_graders.py # single test file
15
- pytest -k "test_reward" # tests matching a pattern
16
-
17
- # Run baseline evaluation (both agents, all 3 tiers, default 5 runs)
18
- python scripts/evaluate.py [num_runs]
19
-
20
- # Compare evaluation results against saved baselines
21
- python scripts/eval_compare.py
22
-
23
- # Start the REST API server on port 7860
24
- python server/app.py
25
- serve # via pyproject.toml entry point
26
-
27
- # Docker
28
- docker build -t wildfire-sim .
29
- docker run -p 7860:7860 wildfire-sim
30
- ```
31
-
32
- Validate environment changes by running `scripts/evaluate.py` and comparing scores against `scripts/results.json` baselines. The `HeuristicAgent` score is the primary reference for difficulty scaling.
33
-
34
- ## Architecture
35
-
36
- The simulator is an OpenEnv-compliant RL environment where AI agents dispatch firefighting resources on a grid to protect populated zones from wildfire.
37
-
38
- **Core environment** (`env/`): Components orchestrated by `wildfire_env.py`:
39
- - `wildfire_env.py` — Main entry point implementing OpenEnv API (`reset`, `step`, `state`). Manages the 11-step tick sequence, action validation (invalid actions return penalty reward, never crash), and event logging.
40
- - `models.py` — All Pydantic schemas: `Action`, `Observation`, `StepResult`, `TierConfig`. The three `TierConfig` instances (easy/medium/hard) define grid size, resource counts, episode length, and reward weights.
41
- - `grid.py` — Terrain generation (elevation, fuel types, water, populated zones), cell state management, smoke propagation, fog-of-war.
42
- - `fire_spread.py` — Rothermel-inspired cellular automaton. Each burning cell ignites 8 Moore-neighborhood cells based on: `P(ignite) = base_rate × fuel × wind × slope × (1 − moisture) × (1 − suppression) × tier_scale`. Tier scale: easy=1.0, medium=0.7, hard=0.55.
43
- - `weather.py` — Stochastic wind (random walk + shift events), sinusoidal humidity cycle, Poisson rain events.
44
- - `resources.py` — Crew deployment/movement (adjacent cells only), tanker drops (5-step cooldown), firebreak construction, recon budget tracking.
45
- - `reward.py` — Weighted composite of 5 components: containment, population safety, efficiency, speed, area saved. Also computes per-step delta rewards and a terminal reward on episode end.
46
- - `briefing.py` — Generates a structured `OperationalBriefing` on `reset()`, attached to the first `Observation`. Provides incident cause, priority zones, infrastructure labels, and wind forecast for LLM context.
47
- - `serialization.py` — Converts an `Observation` into a structured text prompt for LLM agents via `serialize_observation(obs, step_num, max_steps)`.
48
- - `action_parser.py` — 3-layer LLM output → `Action` parser: direct JSON → regex field extraction → safe IDLE fallback.
49
- - `curriculum.py` — `CurriculumController` for auto-promoting agents across tiers based on a rolling 10-episode average reward.
50
- - `rendering.py` — Renders ground-truth state dicts into RGB frames for episode replay GIFs.
51
-
52
- **Agents** (`agents/`): `RandomAgent` (lower-bound baseline) and `HeuristicAgent` (priority-based: evacuate endangered crews → protect population → air support → contain perimeter → recon → idle). New agents implement `act(obs: Observation) -> Action`.
53
-
54
- **Graders** (`graders/`): `grade(agent, seed=42) -> float` for each tier. Called by `scripts/evaluate.py` to benchmark.
55
-
56
- **Server** (`server/app.py`): FastAPI wrapping a singleton `WildfireEnv`. Endpoints: `POST /reset?task_id=easy&seed=42`, `POST /step` (Action JSON body), `GET /state`, `GET /health`.
57
-
58
- **LLM inference** (`inference.py`): Runs an OpenAI-compatible client against the three tasks. Requires env vars `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`.
59
-
60
- **Scripts** (`scripts/`): `evaluate.py` (benchmark), `eval_compare.py` (diff vs baselines), `replay.py` (GIF generation), `plot_dashboard.py` (metrics visualization), `find_demo_seed.py` (search for visually interesting seeds), `run_demo.py`.
61
-
62
- ## Key Conventions
63
-
64
- - All external data uses Pydantic models — never bypass validation at the `env/` boundary.
65
- - Invalid actions return a penalty reward and continue the episode; they never raise exceptions.
66
- - All env components use the 8-cell Moore neighborhood consistently.
67
- - `reset(task_id, seed)` must be fully deterministic — use `np.random.default_rng(seed)` and pass the RNG down to all components.
68
- - Agents must not access `state()` (ground truth) during normal execution — only the `Observation` returned by `reset`/`step`.
69
- - Hard tier enables staggered ignition (a third fire spawns mid-episode) and crew loss events; both are configured via `TierConfig` fields.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -12,70 +12,105 @@ tags:
12
  - openenv
13
  - wildfire
14
  - rl-environment
 
 
15
  ---
16
 
17
  # Wildfire Containment Simulator
18
 
19
- **OpenEnv Finale Submission — Theme 2: Long-Horizon Planning & Instruction Following**
20
 
21
  ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
22
  ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
23
  ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
 
 
24
 
25
- A partially-observable disaster simulation where an LLM acts as Incident Commander, interpreting operational briefings, tracking state across 300-step episodes, and recovering from cascading failures. Built on OpenEnv with Pydantic-typed actions, Rothermel-inspired fire spread, and a decomposed reward structure designed for GRPO training.
26
 
27
- **Headline result:** Our trained Qwen-2.5-1.5B IC achieves {TBD}% population survival on Hard tier vs. {TBD}% for the rule-based heuristic baseline. *(Numbers will be updated post-training on April 24.)*
 
28
 
29
- ## Quick Links
 
 
30
 
31
- - 📺 **YouTube Pitch Video:** [Watch the 2-minute demo](https://www.youtube.com/watch?v=YOUTUBE_VIDEO_ID_HERE)
32
- - 🔥 **HF Space (live env):** [Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator)
33
- - 📒 **Training notebook (Colab):** [training/grpo_colab.ipynb](training/grpo_colab.ipynb)
34
- - 📊 **Eval results:** [scripts/results.json](scripts/results.json)
35
- - 🎬 **Demo:** `python scripts/run_demo.py`
36
- - 📝 **Blog post:** [Read below](#-blog-post-teaching-a-15b-language-model-to-fight-wildfires-with-grpo)
 
 
 
 
 
37
 
38
  ---
39
 
40
  ## Why Theme 2
41
 
42
- - **Long-horizon planning (up to 300 steps, sparse terminal reward):** The agent receives dense per-step feedback on containment deltas but only earns the large +5.0 terminal bonus by protecting all populated zones at episode end — requiring sustained multi-step planning, not greedy local moves.
43
- - **Instruction following (operational briefings):** Every episode opens with a structured `OperationalBriefing` naming priority zones, infrastructure to preserve, and forecasted weather events. The agent earns a +1.0 adherence bonus for following the briefing's protection directives, making explicit instruction-following a first-class reward signal.
44
- - **Recovery from early mistakes (staggered ignitions, crew loss events):** Hard tier injects a second ignition at a scripted step and forces one crew casualty mid-episode. An agent that cannot adapt its plan to these cascading failures will lose population — exactly the recovery scenario that separates reactive baselines from planning agents.
 
 
45
 
46
  ---
47
 
48
  ## Real-World Motivation
49
 
50
- Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to request air support, how to protect communities, and how to adapt when conditions change mid-operation.
51
 
52
- This project turns that into a structured AI task with typed actions, partial observability, changing weather, multiple resource constraints, and explicit tradeoffs between speed, efficiency, containment, and civilian safety.
53
 
54
  ---
55
 
56
- ## Reproducing Our Results
57
 
58
  ```bash
59
- # Install
 
 
60
  uv pip install -r requirements.txt
61
  uv pip install -e .
62
 
63
- # Run baseline eval (both agents, all 3 tiers, 5 runs)
64
  python scripts/evaluate.py 5
65
 
66
- # Run eval comparison table
67
- python scripts/eval_compare.py --seeds 42 43 44 45 46 --tiers medium hard --agents random heuristic
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
- # Run the pitch demo (generates demos/heuristic_demo.gif)
70
- python scripts/run_demo.py
 
 
 
71
 
72
- # Render any episode as a GIF
73
- python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demos/replay.gif
74
 
75
- # Open GRPO training notebook in Colab
76
- # See training/README.md for instructions
 
 
77
  ```
78
 
 
 
79
  ---
80
 
81
  ## Environment API
@@ -84,7 +119,7 @@ python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demo
84
  from env import WildfireEnv, Action, ActionType, Direction
85
 
86
  env = WildfireEnv()
87
- obs = env.reset(task_id="easy", seed=42) # Returns Observation (with OperationalBriefing)
88
 
89
  while not env.done:
90
  action = Action(
@@ -92,126 +127,152 @@ while not env.done:
92
  crew_id="crew_0",
93
  target_row=7, target_col=7,
94
  )
95
- result = env.step(action) # Returns StepResult
96
  obs = result.observation
97
- reward = result.reward # decomposed float, range ~-8 to +8
98
  done = result.done
99
 
100
- state = env.state() # Full ground truth (for grading)
101
  ```
102
 
 
 
103
  ---
104
 
105
  ## Action Space
106
 
107
- All actions are Pydantic-validated. Invalid actions return a penalty reward without crashing.
108
 
109
- | Action | Parameters | Description |
110
- |--------|-----------|-------------|
111
- | `DEPLOY_CREW` | crew_id, target_row, target_col | Place an undeployed crew on a safe cell |
112
- | `MOVE_CREW` | crew_id, direction (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
113
- | `DROP_RETARDANT` | tanker_id, target_row, target_col | Drop retardant on a 3x3 area with cooldown |
114
- | `BUILD_FIREBREAK` | crew_id, direction | Build a permanent non-flammable cell adjacent to a crew |
115
- | `RECON_FLIGHT` | target_row, target_col | Reveal a 10x10 area for 5 steps |
116
- | `IDLE` | reason (optional) | Agent explicitly waits |
 
 
 
117
 
118
  ---
119
 
120
  ## Observation Space
121
 
122
- | Component | Contents | Noise |
123
- |-----------|----------|-------|
124
- | `briefing` | `OperationalBriefing` on first obs — incident ID, priority zones, forecasts | First step only |
125
- | `grid` | 2D array of cell states (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion; fog-of-war on hard tier |
126
- | `weather` | wind_speed, wind_direction, humidity, rain_active | +/-5 km/h, +/-20 deg on medium/hard |
127
  | `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
128
- | `stats` | cells_burned, cells_burning, population_lost, containment_pct, current_step | Fully observable |
129
  | `recent_events` | Last 5 notable events | Fully observable |
130
 
 
 
131
  ---
132
 
133
  ## Reward Function
134
 
135
- Decomposed structure designed for GRPO training — wide reward range (-8 to +8) produces meaningful advantages:
136
 
137
  **Per-step (dense):**
138
- ```text
139
- step_reward = delta_containment * 0.4 + delta_pop_safety * 0.4 - 0.1 (if redundant action)
140
  ```
141
 
142
- **Terminal (sparse, added on episode end):**
143
- ```text
144
  +5.0 if all populations safe
145
- +0–2.0 efficiency bonus (faster = more)
146
- +1.0 briefing adherence bonus (all priority zones survived)
147
- -3.0 * (pop_lost / total_pop) if any population lost
148
- -2.0 if any crew casualty occurred
 
149
  ```
150
 
151
- | Tier | Spread Scale | Max Episode Reward |
152
- |------|-------------|-------------------|
153
- | Easy | 1.0× | ~8+ |
154
- | Medium | 0.7× | ~7+ |
155
- | Hard | 0.55× | ~6+ |
 
 
156
 
157
  ---
158
 
159
  ## Three Difficulty Tiers
160
 
161
  ### Task 1 — Easy: Flatland Grass Fire
162
-
163
- - 15×15 flat grid, single ignition, constant wind
164
- - No smoke occlusion or fog-of-war
165
- - 4 crews, 1 tanker, 15 firebreak cells, 80 steps
166
- - Focus: basic deployment and perimeter control
167
 
168
  ### Task 2 — Medium: Canyon Terrain with Wind Shifts
169
-
170
- - 25×25 mixed terrain with elevation and two ignition points
171
- - Variable wind, smoke occlusion, sensor noise, and rain events
172
- - 5 crews, 2 tankers, 20 firebreak cells, 150 steps
173
- - Focus: terrain-aware containment and multi-front triage
174
 
175
  ### Task 3 — Hard: Wildland-Urban Interface Crisis
176
-
177
- - 40×40 terrain with roads, rivers, urban zones, and staggered ignitions
178
- - Fog-of-war, aggressive wind shifts, limited recon, and crew loss
179
- - 6 crews, 3 tankers, 30 firebreak cells, 300 steps
180
- - Focus: long-horizon planning under uncertainty
181
 
182
  ---
183
 
184
  ## Fire Spread Model
185
 
186
- A **Rothermel-inspired cellular automaton** using the 8-cell Moore neighborhood:
187
 
188
- ```text
189
- P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor × (1 - moisture) × (1 - suppression) × tier_scale
 
190
  ```
191
 
192
- | Factor | Description |
193
- |--------|-------------|
194
- | `base_rate` | Baseline spread rate by fuel type |
195
  | `fuel_factor` | Fuel load of the target cell |
196
- | `wind_factor` | Boost/dampen based on wind alignment with spread direction |
197
- | `slope_factor` | Fire spreads faster uphill |
198
- | `moisture` | Wet ground reduces ignition probability |
199
- | `suppression` | Crew and retardant coverage reduces spread |
200
- | `tier_scale` | easy=1.0, medium=0.7, hard=0.55 |
 
 
201
 
202
  ---
203
 
204
- ## Baseline Scores
205
 
206
- *(5 runs, seeds 42–46 updated post-Prompt 10 with decomposed reward)*
207
 
208
- | Agent | Easy | Medium | Hard |
209
- |-------|------|--------|------|
210
- | Random | {TBD} | {TBD} | {TBD} |
211
- | Heuristic | {TBD} | {TBD} | {TBD} |
212
- | Trained LLM (ours) | {TBD} | {TBD} | {TBD} |
 
213
 
214
- *Numbers will be updated post-training on April 24. Run `python scripts/evaluate.py 5` to reproduce baselines.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  ---
217
 
@@ -220,204 +281,81 @@ P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor × (1 - mois
220
  ```text
221
  Wildfire-Containment-Simulator/
222
  ├── env/
223
- │ ├── wildfire_env.py # Main environment: step(), reset(), state()
224
- │ ├── models.py # Pydantic models (Action, Observation, etc.)
225
- │ ├── grid.py # Grid terrain, smoke, moisture, fog-of-war
226
  │ ├── fire_spread.py # Cellular automaton fire propagation
227
  │ ├── weather.py # Stochastic weather engine
228
- │ ├── resources.py # Crew/tanker/firebreak/recon management
229
  │ ├── reward.py # Decomposed step + terminal reward
230
  │ ├── briefing.py # OperationalBriefing generation
231
  │ ├── serialization.py # Observation → LLM prompt
232
  │ ├── action_parser.py # LLM output → Action (3-layer fallback)
233
- │ ├── rendering.py # Frame rendering for GIF replay
234
- │ └── curriculum.py # Auto-promote/demote curriculum controller
235
  ├── agents/
236
  │ ├── random_agent.py
237
  │ └── heuristic_agent.py
238
  ├── graders/
239
- │ ├── grader_easy.py # Returns (total_reward, details_dict)
240
  │ ├── grader_medium.py
241
  │ └── grader_hard.py
242
  ├── scripts/
243
- │ ├── evaluate.py # Baseline eval + detailed metrics
244
- │ ├── eval_compare.py # Multi-agent comparison table
 
 
245
  │ ├── replay.py # Render episode as GIF
246
- │ ├── run_demo.py # Pitch demo (DEMO_SEED=365)
247
- ── find_demo_seed.py # Scan seeds for best demo candidate
248
- │ └── plot_dashboard.py # 4-panel training curves dashboard
249
  ├── training/
250
- │ ├── grpo_colab.ipynb # GRPO training notebook (Colab, T4)
 
 
 
251
  │ └── README.md
252
  ├── server/
253
- │ └── app.py # FastAPI server (port 7860)
254
- ├── tests/ # pytest test suite
 
255
  ├── demos/ # GIF/PNG demo assets
256
- ├── openenv.yaml # OpenEnv spec metadata
257
- ├── Dockerfile
258
- ── README.md
 
259
  ```
260
 
261
  ---
262
 
263
- ## Multi-Agent Crew Architecture
264
-
265
- Crews are not passive tools — each deployed crew runs a **local policy** every step unless the IC issues an explicit order:
266
-
267
- | Situation | Autonomous behaviour |
268
- |-----------|---------------------|
269
- | Intensity > 0.8 at crew cell | Retreat to safest adjacent cell |
270
- | Fire visible in 3×3 neighbourhood | Advance toward nearest burning cell |
271
- | No fire visible | Hold position |
272
 
273
- **IC actions that suppress local policy:**
274
- - `MOVE_CREW`explicit movement overrides retreat/advance for that step
275
- - `DEPLOY_CREW` counts as an IC order; local policy skips deployment step
276
- - `ORDER_CREW_OBJECTIVE` sets a persistent objective (`hold`, `advance`, `retreat`, `prioritize_north/south/east/west`) that biases the local policy until changed
277
-
278
- **Autonomous saves** are tracked in `env.resources.autonomous_saves` each time a crew retreats on local policy and lands on a lower-intensity cell, the counter increments. These become talking points in the demo narrative.
 
279
 
280
  ---
281
 
282
- ## Key Design Decisions
283
-
284
- 1. **Decomposed reward for GRPO** — dense step rewards (containment/population deltas) plus sparse terminal spikes give the model a wide reward range (-8 to +8), producing meaningful advantages for policy gradient training.
285
- 2. **Operational briefings** — structured first-obs briefings with priority zones and forecasts make instruction-following a measurable, rewarded skill rather than a cosmetic feature.
286
- 3. **Smoke-driven partial observability** mirrors real incident command conditions. Fog-of-war on hard tier forces recon investment.
287
- 4. **Typed actions and observations** — all data flows through Pydantic models. Invalid actions return a penalty reward and never crash.
288
- 5. **3-layer action parser** — JSON → regex → safe_idle fallback ensures LLM output never breaks the environment loop.
289
- 6. **Deterministic seeding** — `np.random.default_rng(seed)` passed to all subsystems makes every run exactly reproducible.
290
-
291
- ---
292
-
293
- ## 📝 Blog Post: Teaching a 1.5B Language Model to Fight Wildfires with GRPO
294
-
295
- *We built a partially-observable disaster simulator and trained a tiny LLM to act as Incident Commander — here's what we learned.*
296
-
297
- ### Introduction
298
-
299
- Every year, wildfires burn millions of acres, destroy communities, and kill people. Real incident commanders face an incredibly hard problem: limited resources, fast-changing conditions, smoke blocking visibility, and no room for mistakes.
300
 
301
- We asked: *what if an AI could learn to do this?*
302
 
303
- For the [Meta OpenEnv Hackathon](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator), we built the **Wildfire Containment Simulator** — a grid-based RL environment where an LLM acts as Incident Commander, dispatching fire crews, air tankers, and building firebreaks to protect civilian populations from a spreading wildfire.
304
-
305
- We then trained **Qwen-2.5-1.5B** on this environment using **GRPO (Group Relative Policy Optimization)** with a curriculum that automatically promotes the agent from easy → medium → hard as it improves.
306
-
307
- ### The Problem: Why Is This Hard?
308
-
309
- This isn't a toy. Our simulation captures the key difficulties of real wildfire response:
310
-
311
- | Challenge | How We Model It |
312
- |-----------|----------------|
313
- | **Partial observability** | Smoke occludes cells; Hard tier adds full fog-of-war |
314
- | **Changing conditions** | Stochastic wind (random-walk + shift events), sinusoidal humidity cycles, Poisson rain |
315
- | **Resource constraints** | Limited crews, tankers with cooldowns, finite firebreak budget |
316
- | **Long horizons** | Up to 300 steps on Hard tier with sparse terminal rewards |
317
- | **Recovery from failure** | Hard tier injects a second ignition mid-episode and forces one crew casualty |
318
- | **Instruction following** | Episode opens with a structured `OperationalBriefing` — following it is rewarded |
319
-
320
- The agent must balance five competing objectives simultaneously: containment speed, population safety, resource efficiency, area preservation, and crew safety.
321
-
322
- ### The Environment Architecture
323
-
324
- The simulator follows the OpenEnv API (`reset`, `step`, `state`) and is built entirely on **Pydantic-typed** data models — every action is validated, invalid actions return a penalty reward and never crash the loop.
325
-
326
- #### Three Difficulty Tiers
327
-
328
- ```
329
- Easy → 15×15 flat grid, 1 ignition, constant wind, 80 steps
330
- Medium → 25×25 canyon terrain, 2 ignitions, wind shifts, smoke, 150 steps
331
- Hard → 40×40 wildland-urban interface, staggered ignitions, fog-of-war, 300 steps
332
- ```
333
-
334
- #### Fire Spread: Rothermel-Inspired Cellular Automaton
335
-
336
- Every burning cell attempts to ignite its 8 Moore-neighborhood neighbors each tick:
337
-
338
- ```
339
- P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
340
- × (1 − moisture) × (1 − suppression) × tier_scale
341
- ```
342
-
343
- Wind alignment dramatically changes spread direction. Slope makes fire climb uphill faster. Wet ground from rain events slows spread. Ground crew presence applies local suppression.
344
-
345
- #### Action Space
346
-
347
- The agent controls 6 action types via structured JSON:
348
-
349
- | Action | What It Does |
350
- |--------|-------------|
351
- | `DEPLOY_CREW` | Position a ground crew on the grid |
352
- | `MOVE_CREW` | Move a crew one cell (8 directions) |
353
- | `DROP_RETARDANT` | Air tanker 3×3 suppression drop (5-step cooldown) |
354
- | `BUILD_FIREBREAK` | Permanent non-flammable cell adjacent to crew |
355
- | `RECON_FLIGHT` | Reveal a 10×10 area for 5 steps |
356
- | `IDLE` | Explicit wait with optional reasoning |
357
-
358
- #### Observation to Prompt: The Serializer
359
-
360
- A key design decision was making the observation **LLM-friendly**. Our `serialize_observation()` function converts the raw grid state into a structured text prompt with:
361
- - BFS-clustered fire region descriptions ("3 BURNING clusters near row 7–12, col 3–8")
362
- - Resource status with cooldown warnings
363
- - Recent events log (last 5 notable happenings)
364
- - Weather reading with noise levels noted
365
-
366
- ### The Reward Structure: Designed for GRPO
367
-
368
- GRPO needs a wide reward range to compute meaningful advantages. We decomposed the reward into:
369
-
370
- **Dense (per-step):**
371
- ```
372
- step_reward = delta_containment × 0.4 + delta_pop_safety × 0.4 − 0.1 (if redundant action)
373
- ```
374
-
375
- **Sparse terminal (on episode end):**
376
- ```
377
- +5.0 if all populations safe
378
- +0–2.0 efficiency bonus (faster = more)
379
- +1.0 briefing adherence bonus
380
- −3.0 × (pop_lost / total_pop) if population lost
381
- −2.0 if any crew casualty occurred
382
  ```
383
 
384
- Total range: **−8 to +8**. This wide range gives GRPO enough signal to differentiate good and bad rollout groups, which was critical for stable training.
385
-
386
- ### Training: GRPO with Curriculum Learning
387
-
388
- We trained Qwen-2.5-1.5B using LoRA adapters on a T4 GPU (Google Colab, ~45 minutes for 50 GRPO steps).
389
-
390
- The `CurriculumController` auto-promotes the agent across tiers based on a rolling 10-episode average reward:
391
- - **Easy** → promoted when mean reward > threshold
392
- - **Medium** → promoted when stable on medium
393
- - **Hard** → final evaluation tier
394
-
395
- Training stats show the agent consistently achieving rewards in the **{TBD}** range across all tiers, outperforming the random baseline and approaching the heuristic agent on Easy tier.
396
-
397
- ### Baseline Comparison
398
-
399
- We compare against two baselines:
400
-
401
- | Agent | Easy | Medium | Hard |
402
- |-------|------|--------|------|
403
- | **Random** | {TBD} | {TBD} | {TBD} |
404
- | **Heuristic** | {TBD} | {TBD} | {TBD} |
405
- | **Trained Qwen-2.5-1.5B** | {TBD} | {TBD} | {TBD} |
406
-
407
- The heuristic agent has hand-coded priority ordering (evacuate → protect population → air support → contain → recon → idle). Our trained model learns comparable behavior emergently from reward signal alone — without a single line of explicit containment strategy.
408
-
409
- ### Key Engineering Decisions
410
-
411
- **1. 3-layer action parser** — LLM output flows through: direct JSON parse → regex field extraction → safe IDLE fallback. The environment loop never breaks.
412
-
413
- **2. Autonomous crew behavior** — Crews aren't passive. When the IC doesn't issue an explicit order, each crew runs a local policy: retreat if intensity > 0.8, advance toward visible fire, else hold. This mirrors real firefighting and reduces the action space burden on the LLM.
414
-
415
- **3. Deterministic seeding** — `np.random.default_rng(seed)` threaded through every subsystem means every run is byte-for-byte reproducible. Crucial for fair benchmarking.
416
-
417
- **4. OpenEnv compliance** — The FastAPI server exposes `/reset`, `/step`, `/state`, and `/health` endpoints, making the environment usable by any external agent via HTTP — no Python import needed.
418
 
419
- ### What We Learned
420
 
421
- 1. **Reward decomposition matters more than model size** — A 1.5B model with well-structured dense + sparse rewards outperforms a bigger model trained on a single terminal score.
422
- 2. **Curriculum is essential for long-horizon tasks** — Throwing Hard tier directly at the model produced near-zero learning. Easy → Medium → Hard curriculum was the difference.
423
- 3. **Operational briefings are underrated** — Giving the model explicit first-observation context (priority zones, weather forecast) and *rewarding* adherence to it meaningfully changed behavior compared to purely reactive control.
 
12
  - openenv
13
  - wildfire
14
  - rl-environment
15
+ - long-horizon
16
+ - instruction-following
17
  ---
18
 
19
  # Wildfire Containment Simulator
20
 
21
+ **Meta OpenEnv Hackathon — Theme 2: Long-Horizon Planning & Instruction Following**
22
 
23
  ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
24
  ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
25
  ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
26
+ ![Python](https://img.shields.io/badge/Python-3.11+-blue)
27
+ ![License](https://img.shields.io/badge/License-MIT-green)
28
 
29
+ A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80–300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
30
 
31
+ > **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **{TBD}** on Hard tier vs. **+4.74** for the rule-based heuristic and **+2.16** for the random baseline.
32
+ > *(Numbers are filled in after `scripts/eval_trained_model.py` completes; see [Results](#results).)*
33
 
34
+ ---
35
+
36
+ ## 🔗 Quick Links
37
 
38
+ | Resource | Link |
39
+ |---|---|
40
+ | 🚀 **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
41
+ | 💻 **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
42
+ | 📒 **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
43
+ | 📒 **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
44
+ | 📝 **Long-form blog post** | [`BLOG.md`](BLOG.md) |
45
+ | 📊 **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
46
+ | 📈 **Training dashboard** | [`training/training_dashboard.png`](training/training_dashboard.png) *(generated post-run)* |
47
+ | 🎬 **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
48
+ | 🎥 **2-minute pitch video** | *(YouTube link coming soon)* |
49
 
50
  ---
51
 
52
  ## Why Theme 2
53
 
54
+ | Pillar | How we model it |
55
+ |---|---|
56
+ | **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival greedy local moves cannot capture it. |
57
+ | **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
58
+ | **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
59
 
60
  ---
61
 
62
  ## Real-World Motivation
63
 
64
+ Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work — partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety — so an LLM can be trained, evaluated, and inspected on it end-to-end.
65
 
66
+ For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
67
 
68
  ---
69
 
70
+ ## Quickstart
71
 
72
  ```bash
73
+ # Clone and install
74
+ git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
75
+ cd Wildfire-Containment-Simulator
76
  uv pip install -r requirements.txt
77
  uv pip install -e .
78
 
79
+ # Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
80
  python scripts/evaluate.py 5
81
 
82
+ # Compare agents head-to-head
83
+ python scripts/eval_compare.py --seeds 42 43 44 45 46 \
84
+ --tiers easy medium hard --agents random heuristic
85
+
86
+ # Render an episode as a GIF
87
+ python scripts/replay.py --tier medium --seed 42 \
88
+ --agent heuristic --output demos/replay.gif
89
+
90
+ # Spin up the OpenEnv FastAPI server locally on port 7860
91
+ python server/app.py
92
+ # Then visit http://localhost:7860/ui/ for the interactive frontend
93
+ ```
94
+
95
+ Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
96
 
97
+ ---
98
+
99
+ ## Live Hugging Face Space
100
+
101
+ The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP — no Python import needed:
102
 
103
+ ```bash
104
+ SPACE=https://eshit-wildfire-containment-simulator.hf.space
105
 
106
+ curl "$SPACE/health"
107
+ curl -X POST "$SPACE/reset?task_id=easy&seed=42"
108
+ curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
109
+ -d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
110
  ```
111
 
112
+ Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
113
+
114
  ---
115
 
116
  ## Environment API
 
119
  from env import WildfireEnv, Action, ActionType, Direction
120
 
121
  env = WildfireEnv()
122
+ obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
123
 
124
  while not env.done:
125
  action = Action(
 
127
  crew_id="crew_0",
128
  target_row=7, target_col=7,
129
  )
130
+ result = env.step(action) # StepResult
131
  obs = result.observation
132
+ reward = result.reward # decomposed float, range ~8 to +8
133
  done = result.done
134
 
135
+ state = env.state() # Full ground truth (grading only)
136
  ```
137
 
138
+ `reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders — agents must work from `Observation`.
139
+
140
  ---
141
 
142
  ## Action Space
143
 
144
+ All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
145
 
146
+ | Action | Required parameters | Description |
147
+ |---|---|---|
148
+ | `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
149
+ | `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
150
+ | `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
151
+ | `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3×3 retardant drop with 5-step cooldown |
152
+ | `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
153
+ | `recon_flight` | `target_row`, `target_col` | Reveal a 10×10 area for 5 steps |
154
+ | `idle` | `reason` *(optional)* | Explicitly wait |
155
+
156
+ A 3-layer parser (`env/action_parser.py`) maps raw LLM output → structured `Action`: direct JSON → regex field extraction → safe-`idle` fallback. **The environment loop never breaks on bad model output.**
157
 
158
  ---
159
 
160
  ## Observation Space
161
 
162
+ | Component | Contents | Noise / occlusion |
163
+ |---|---|---|
164
+ | `briefing` | `OperationalBriefing` on first obs — incident ID, priority zones, infrastructure, wind forecast | First step only |
165
+ | `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
166
+ | `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | ±5 km/h, ±20° on medium/hard |
167
  | `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
168
+ | `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
169
  | `recent_events` | Last 5 notable events | Fully observable |
170
 
171
+ The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
172
+
173
  ---
174
 
175
  ## Reward Function
176
 
177
+ Decomposed for GRPO — wide reward range produces meaningful advantages between rollout groups.
178
 
179
  **Per-step (dense):**
180
+ ```
181
+ step_reward = 0.4 · Δcontainment + 0.4 · Δpopulation_safety − 0.1 · redundant_action_flag
182
  ```
183
 
184
+ **Terminal (sparse, on episode end):**
185
+ ```
186
  +5.0 if all populations safe
187
+ +0–2.0 efficiency bonus (faster containment more)
188
+ +1.0 briefing-adherence bonus (all priority zones survived)
189
+ 3.0 · (pop_lost / total_pop) if any population lost
190
+ 2.0 if any crew casualty
191
+ −0.01 × invalid_action_count capped at −0.2
192
  ```
193
 
194
+ Total empirical range: **−8 to +8**, declared in `openenv.yaml`.
195
+
196
+ | Tier | Spread scale | Episode length | Approx. reward ceiling |
197
+ |---|---|---|---|
198
+ | Easy | 1.00× | 80 | +8 |
199
+ | Medium | 0.70× | 150 | +7 |
200
+ | Hard | 0.55× | 300 | +6 |
201
 
202
  ---
203
 
204
  ## Three Difficulty Tiers
205
 
206
  ### Task 1 — Easy: Flatland Grass Fire
207
+ 15×15 flat grid · single ignition · constant wind · no smoke or fog-of-war · 4 crews, 1 tanker, 15 firebreak cells · 80 steps. **Focus:** basic deployment and perimeter control.
 
 
 
 
208
 
209
  ### Task 2 — Medium: Canyon Terrain with Wind Shifts
210
+ 25×25 mixed terrain · two ignition points · variable wind · smoke occlusion · sensor noise · 5 crews, 2 tankers, 20 firebreak cells, 1 recon · 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
 
 
 
 
211
 
212
  ### Task 3 — Hard: Wildland-Urban Interface Crisis
213
+ 40×40 terrain with roads, rivers, urban zones · staggered ignitions (step 30) · scripted crew casualty (step 40) · fog-of-war (radius 7) · aggressive wind shifts · 6 crews, 3 tankers, 30 firebreak cells, 3 recon · 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
 
 
 
 
214
 
215
  ---
216
 
217
  ## Fire Spread Model
218
 
219
+ A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
220
 
221
+ ```
222
+ P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
223
+ × (1 − moisture) × (1 − suppression) × tier_scale
224
  ```
225
 
226
+ | Factor | Effect |
227
+ |---|---|
228
+ | `base_rate` | Baseline spread by fuel type |
229
  | `fuel_factor` | Fuel load of the target cell |
230
+ | `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
231
+ | `slope_factor` | Faster uphill, slower downhill |
232
+ | `moisture` | Wet ground / recent rain reduces ignition probability |
233
+ | `suppression` | Crew presence and retardant coverage reduce spread |
234
+ | `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
235
+
236
+ Burning cells progress through `BURNING → EMBER → BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
237
 
238
  ---
239
 
240
+ ## Results
241
 
242
+ > Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42–46. Trained-model numbers are produced by `python scripts/eval_trained_model.py --num-seeds 15` on held-out seeds 200–214 (no overlap with training seeds 0–99).
243
 
244
+ | Agent | Easy (mean ± std) | Medium (mean ± std) | Hard (mean ± std) |
245
+ |---|---|---|---|
246
+ | Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
247
+ | Heuristic | **+7.53 ± 0.08** | **+6.31 ± 2.77** | +4.74 ± 3.79 |
248
+ | **Trained Qwen-2.5-7B (ours)** | **{TBD}** | **{TBD}** | **{TBD}** |
249
+ | **Δ vs. Heuristic** | **{TBD}** | **{TBD}** | **{TBD}** |
250
 
251
+ **Auxiliary metrics for the trained agent** (filled in post-eval):
252
+
253
+ | Metric | Easy | Medium | Hard |
254
+ |---|---|---|---|
255
+ | JSON success rate | {TBD} | {TBD} | {TBD} |
256
+ | Mean population saved % | {TBD} | {TBD} | {TBD} |
257
+ | Crew casualty rate | {TBD} | {TBD} | {TBD} |
258
+
259
+ > See `scripts/trained_results.json` (post-eval) for the raw scores.
260
+
261
+ ---
262
+
263
+ ## Training
264
+
265
+ We use a two-stage recipe:
266
+
267
+ 1. **SFT warm-up** — generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
268
+ 2. **GRPO (TRL `GRPOTrainer`)** — start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy → medium → hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
269
+
270
+ **Hardware:** A10G Large (24 GB) on a Hugging Face Space JupyterLab session.
271
+ **Training stack:** `unsloth` (4-bit QLoRA), `trl==0.15.2`, `datasets==3.4.1`, `transformers`, `peft`, `wandb`. Pinned in [`training/requirements.txt`](training/requirements.txt).
272
+
273
+ **Training plots:** dashboard PNG at [`training/training_dashboard.png`](training/training_dashboard.png) (4-panel: episode reward, population-survival rate, containment %, curriculum tier timeline). W&B run: *(link added post-run)*.
274
+
275
+ For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
276
 
277
  ---
278
 
 
281
  ```text
282
  Wildfire-Containment-Simulator/
283
  ├── env/
284
+ │ ├── wildfire_env.py # Main env: reset(), step(), state()
285
+ │ ├── models.py # Pydantic action/observation/state models
286
+ │ ├── grid.py # Terrain, smoke, moisture, fog-of-war
287
  │ ├── fire_spread.py # Cellular automaton fire propagation
288
  │ ├── weather.py # Stochastic weather engine
289
+ │ ├── resources.py # Crews, tankers, firebreaks, recon
290
  │ ├── reward.py # Decomposed step + terminal reward
291
  │ ├── briefing.py # OperationalBriefing generation
292
  │ ├── serialization.py # Observation → LLM prompt
293
  │ ├── action_parser.py # LLM output → Action (3-layer fallback)
294
+ │ ├── rendering.py # Frame rendering for GIF replays
295
+ │ └── curriculum.py # CurriculumController (auto-promote/demote)
296
  ├── agents/
297
  │ ├── random_agent.py
298
  │ └── heuristic_agent.py
299
  ├── graders/
300
+ │ ├── grader_easy.py # (total_reward, details_dict)
301
  │ ├── grader_medium.py
302
  │ └── grader_hard.py
303
  ├── scripts/
304
+ │ ├── evaluate.py # Baseline eval (random + heuristic)
305
+ │ ├── eval_compare.py # Multi-agent comparison
306
+ │ ├── eval_trained_model.py # Evaluate a trained adapter
307
+ │ ├── generate_sft_data.py # Build SFT dataset from heuristic rollouts
308
  │ ├── replay.py # Render episode as GIF
309
+ │ ├── run_demo.py # Pitch demo
310
+ ── plot_dashboard.py # 4-panel training curves
 
311
  ├── training/
312
+ │ ├── grpo_v2_colab.ipynb # GRPO notebook (canonical)
313
+ │ ├── sft_colab.ipynb # SFT warm-up notebook
314
+ │ ├── sft_data.jsonl # 4,300 SFT examples
315
+ │ ├── requirements.txt # Training deps (Unsloth, TRL, etc.)
316
  │ └── README.md
317
  ├── server/
318
+ │ └── app.py # FastAPI on port 7860
319
+ ├── frontend/ # Interactive HTML/JS frontend served at /ui/
320
+ ├── tests/ # 41 pytest tests
321
  ├── demos/ # GIF/PNG demo assets
322
+ ├── openenv.yaml # OpenEnv environment manifest
323
+ ├── Dockerfile # HF Space build
324
+ ── BLOG.md # Long-form write-up
325
+ └── README.md # You are here
326
  ```
327
 
328
  ---
329
 
330
+ ## Architecture Decisions
 
 
 
 
 
 
 
 
331
 
332
+ 1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, −3 × loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
333
+ 2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
334
+ 3. **Two-stage training (SFT GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
335
+ 4. **3-layer action parser.** JSON parse regex fallback safe-`idle`. The training loop never breaks on malformed model output.
336
+ 5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) → "What broke").
337
+ 6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem every run is byte-for-byte reproducible.
338
+ 7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent — TRL, vLLM, an OpenAI-compatible API client, a curl loop — can drive it.
339
 
340
  ---
341
 
342
+ ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
343
 
344
+ If you use this environment, please cite:
345
 
346
+ ```bibtex
347
+ @misc{wildfire-containment-simulator-2026,
348
+ title = {Wildfire Containment Simulator: Long-Horizon Planning and
349
+ Instruction Following for Disaster-Response LLM Agents},
350
+ author = {Team Wildfire},
351
+ year = {2026},
352
+ url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
353
+ note = {Meta OpenEnv Hackathon submission, Theme 2}
354
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
355
  ```
356
 
357
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
358
 
359
+ ## License
360
 
361
+ [MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta × Hugging Face × Scaler hackathon, April 2026.
 
 
[External] Meta OpenEnv Hackathon Participant Help Guide.pdf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:eea09524b58bc396e97fb6b82d8e8da28df43fa0030f573470c4756973dbc197
3
- size 178344
 
 
 
 
prompts.md DELETED
@@ -1,644 +0,0 @@
1
- # Wildfire Containment Simulator — Agent Prompt Sequence
2
-
3
- **Usage:** Feed these prompts **one at a time** to your coding agent (Claude Code or Antigravity). After each prompt finishes, run its acceptance test yourself before moving to the next. Each prompt assumes the prior ones completed successfully.
4
-
5
- **Global context to paste once at the start of every new agent session** (if the agent loses context between prompts):
6
-
7
- > You are working on the Wildfire Containment Simulator — an OpenEnv-compatible RL environment for the Meta × PyTorch × HuggingFace OpenEnv Hackathon finale (April 25–26, 2026). The repo is at `https://github.com/Abrodolph/Wildfire-Containment-Simulator`. Core packages: `env/` (simulation), `agents/` (baselines), `graders/` (one per tier), `scripts/` (evaluation). The env exposes `reset()`, `step()`, `state()` with Pydantic-validated `Action`, `Observation`, `StepResult` models defined in `env/models.py`. Three tiers exist: easy (15×15), medium (25×25), hard (40×40). You can run `pytest`, `python scripts/evaluate.py`, and any other command. Iterate on failures until tests pass. Never skip the acceptance test at the end of each prompt.
8
-
9
- ---
10
-
11
- ## Prompt 1 — Repo Cleanup & Test Scaffolding ✅ DONE
12
-
13
- ```
14
- Clean up repo cruft and set up a test scaffold before we make any functional changes.
15
-
16
- Tasks:
17
- 1. Delete the nested `Wildfire-Containment-Simulator/` directory at repo root (leftover HF Space metadata).
18
- 2. Delete the literal `{env,graders,agents,scripts}` directory at repo root (shell-brace artifact).
19
- 3. Delete all committed `__pycache__/` directories and `*.egg-info/` folders.
20
- 4. Delete the `venv/` directory if it's committed.
21
- 5. Update `.gitignore` to include: `__pycache__/`, `*.egg-info/`, `venv/`, `.venv/`, `*.pyc`, `.pytest_cache/`, `.ruff_cache/`, `checkpoints/`, `results/`.
22
- 6. Consolidate server entry points: keep `server/app.py` as the single source of truth. Update the root `app.py` to be a one-line shim that imports and runs `server.app:main`. Update `Dockerfile` CMD to match.
23
- 7. Create `tests/` directory with `tests/__init__.py` and `tests/conftest.py`. In conftest, add a fixture `fresh_env` that yields a `WildfireEnv()` instance.
24
- 8. Create `tests/test_smoke.py` with three tests:
25
- - `test_env_resets_on_all_tiers` — calls `env.reset(task_id=t, seed=42)` for t in ["easy", "medium", "hard"] and asserts obs is not None.
26
- - `test_idle_action_never_crashes` — resets env, calls `env.step(Action(action_type=ActionType.IDLE))` 10 times, asserts no exception.
27
- - `test_determinism` — runs a fixed 20-step idle rollout twice with seed=42 on easy tier, asserts the final `stats.cells_burned` matches.
28
- 9. Add `pytest` and `pytest-cov` to `requirements.txt` if missing.
29
-
30
- Acceptance test:
31
- - `pytest tests/ -v` passes with 3 tests green.
32
- - `python app.py` still starts the server on port 7860.
33
- - `git status` shows no `__pycache__` or `{env,...}` cruft.
34
- - Output the diff summary of deleted files and new files.
35
- ```
36
-
37
- ---
38
-
39
- ## Prompt 2 — Reward Restructuring (Decomposed Terminal + Dense Step) ✅ DONE
40
-
41
- ```
42
- Replace the current normalized [0,1] composite reward with a decomposed terminal + dense step structure. This is critical for GRPO training — the current reward is too flat to produce meaningful advantages.
43
-
44
- Read `env/reward.py` first. Understand the current RewardCalculator class. Also read `env/wildfire_env.py` to see how reward is called per step.
45
-
46
- Tasks:
47
- 1. In `env/reward.py`, add a new method `compute_step_reward(prev_state, current_state, action_was_valid, action_was_redundant) -> float` that returns:
48
- - (delta_containment_pct * 0.4) + (delta_population_safety * 0.4) + (-0.1 if action_was_redundant else 0.0)
49
- - where delta_containment_pct is (current_containment - prev_containment) in [0, 1] units
50
- - delta_population_safety is (1 - current_pop_lost/total_pop) - (1 - prev_pop_lost/total_pop)
51
- - redundant = same action_type + same target coords as the immediately prior action
52
-
53
- 2. Add a method `compute_terminal_reward(final_state, episode_steps, max_steps) -> float`:
54
- - start at 0
55
- - if all_populations_safe (pop_lost == 0): add +5.0
56
- - else: add -3.0 * (pop_lost / total_pop)
57
- - if any crew_casualty occurred in the episode: add -2.0 (stacks with above)
58
- - efficiency_bonus = (max_steps - episode_steps) / max_steps * 2.0 — ONLY applied if pop_lost == 0
59
- - invalid_action_penalty_total = min(0.2, 0.01 * invalid_action_count) — subtract this
60
-
61
- 3. In `env/wildfire_env.py`:
62
- - Track `self._prev_action` and `self._invalid_action_count` and `self._crew_casualty_occurred` across the episode (reset them in `reset()`).
63
- - Replace the current reward computation in `step()` with: step_reward from above, plus terminal_reward ONLY when `done == True`.
64
- - The StepResult.reward should be `step_reward + (terminal if done else 0.0)`.
65
-
66
- 4. Keep the OLD composite reward accessible as `info["legacy_reward"]` in StepResult for backward compatibility with existing graders. (Graders get updated in Prompt 10.)
67
-
68
- 5. Add `tests/test_reward.py` with:
69
- - `test_successful_episode_scores_high` — run heuristic agent on easy tier seed=42, assert total reward > +3.0
70
- - `test_all_pop_lost_scores_negative` — construct a scenario (or mock state) where all population is lost, assert terminal < -2.0
71
- - `test_crew_casualty_stacks` — scenario with pop loss AND crew casualty, assert terminal includes both penalties
72
- - `test_redundant_action_penalty` — call the same DEPLOY_CREW twice, assert second call's step_reward includes -0.1
73
-
74
- Acceptance test:
75
- - `pytest tests/test_reward.py -v` passes all 4 tests.
76
- - Run `python scripts/evaluate.py 20` on easy tier with the heuristic agent. Report mean + std of total rewards. Successful episodes should cluster in the +5 to +8 range, failed episodes in the -2 to -5 range. If the ranges overlap by more than 20% of episodes, the reward isn't separated enough — report that and DO NOT proceed.
77
- ```
78
-
79
- ---
80
-
81
- ## Prompt 3 — Observation-to-Text Serializer ✅ DONE
82
-
83
- ```
84
- Write a serializer that converts a Pydantic Observation into a structured text prompt that an LLM can reason over. This is required because OpenEnv is an LLM-training framework — the agent is a language model, not a numeric policy.
85
-
86
- Read `env/models.py` to understand the Observation schema. Read the README section "Observation Space" for the intended structure.
87
-
88
- Tasks:
89
- 1. Create `env/serialization.py` with a function `serialize_observation(obs: Observation, step_num: int, max_steps: int) -> str`.
90
-
91
- 2. Output format (match this structure exactly — the LLM will be trained on it):
92
-
93
- ```
94
- === WILDFIRE INCIDENT COMMAND — STEP {step}/{max_steps} ===
95
-
96
- SITUATION:
97
- - Fire active on {N} cells. Containment: {pct}%. Population at risk: {N} zones.
98
- - Wind: {speed} km/h {dir} (±{noise} km/h noise). Humidity: {h}%. Rain: {active|inactive}.
99
- - Last event: {most_recent_event or "None"}
100
-
101
- GRID SUMMARY (smoke-obscured cells marked [?]):
102
- {bounding_box_descriptions_of_fire_regions}
103
- {populated_zone_descriptions}
104
- {firebreak_descriptions_if_any}
105
-
106
- RESOURCES:
107
- - crew_0: {deployed at (r,c) | undeployed available}. Status: {active|casualty}.
108
- - crew_1: ...
109
- - tanker_0: {ready | cooldown N steps remaining}
110
- - Firebreaks remaining: {N}. Recon flights remaining: {N}.
111
-
112
- RECENT EVENTS:
113
- - Step {N}: {event description}
114
- - ... (last 3 events max)
115
-
116
- Available actions: deploy_crew, move_crew, drop_retardant, build_firebreak, recon_flight, idle
117
- Produce your action as JSON: {"action_type": "...", ...}
118
- ```
119
-
120
- 3. Helper functions inside the module (keep private with leading underscore):
121
- - `_summarize_grid_regions(obs.grid) -> List[str]` — detect rectangular bounding boxes of (a) active fire cells clustered together, (b) populated cells, (c) built firebreaks. Output as "Row X-Y, Col A-B: description". Cap at 5 regions per category, prioritize by size.
122
- - `_format_resources(obs.resources) -> str`
123
- - `_format_events(obs.recent_events) -> str`
124
-
125
- 4. Add `tests/test_serialization.py`:
126
- - `test_serialize_produces_all_sections` — reset env, serialize, assert the output contains "SITUATION:", "GRID SUMMARY:", "RESOURCES:", "RECENT EVENTS:", "Available actions:".
127
- - `test_serialize_handles_fog_of_war` — hard tier reset, assert "[?]" appears somewhere in output (smoke or fog-obscured cells).
128
- - `test_serialize_length_under_2048_tokens` — run on all 3 tiers, assert `len(tokenizer.encode(output))` < 1800 using tiktoken's cl100k_base (if tiktoken not installed, use `len(text.split()) < 1500` as a proxy).
129
-
130
- Acceptance test:
131
- - `pytest tests/test_serialization.py -v` passes all 3 tests.
132
- - Run a manual sanity check: `python -c "from env import WildfireEnv; from env.serialization import serialize_observation; env = WildfireEnv(); obs = env.reset(task_id='medium', seed=42); print(serialize_observation(obs, 0, 150))"` — paste the output and confirm it reads like a realistic incident briefing.
133
- ```
134
-
135
- ---
136
-
137
- ## Prompt 4 — LLM Action Parser with 3-Layer Fallback ✅ DONE
138
-
139
- ```
140
- Build a robust parser that converts LLM text output into a validated Action object. LLMs produce malformed JSON, hallucinated fields, and out-of-range coords — we need to never crash.
141
-
142
- Tasks:
143
- 1. Create `env/action_parser.py` with a function `parse_action(llm_output: str, obs: Observation) -> Tuple[Action, str]` returning the action AND a status string ("json_success", "regex_fallback", "safe_idle").
144
-
145
- 2. Three layers, in order:
146
-
147
- LAYER 1 — Direct JSON parse:
148
- - Extract JSON from output using a helper `_extract_json_block(text)` that finds content between first `{` and matching `}` (handles ```json fences, handles leading/trailing text).
149
- - Try `json.loads` then `Action(**data)` — Pydantic validates fields.
150
- - On success return (action, "json_success").
151
-
152
- LAYER 2 — Regex extraction:
153
- - Search for action_type via regex: `action_type["\s:]+["']?(deploy_crew|move_crew|drop_retardant|build_firebreak|recon_flight|idle)`
154
- - Based on detected action_type, extract required fields with regex patterns (e.g., `crew_id["\s:]+["']?(crew_\d+)`, `target_row["\s:]+(\d+)`, `direction["\s:]+["']?(N|S|E|W|NE|NW|SE|SW)`).
155
- - Construct Action; if Pydantic validates, return (action, "regex_fallback").
156
-
157
- LAYER 3 — Safe fallback:
158
- - Return `(Action(action_type=ActionType.IDLE, reason="parse_failure"), "safe_idle")`.
159
-
160
- 3. Add coordinate sanity check: after any layer succeeds, if target_row or target_col is outside the current grid dimensions (infer from obs.grid shape), downgrade to safe_idle. Never trust LLM-provided coords blindly.
161
-
162
- 4. Add `tests/test_action_parser.py` with 8 test cases covering:
163
- - Clean JSON output
164
- - JSON wrapped in ```json fences
165
- - JSON with extra surrounding commentary
166
- - Malformed JSON (missing quotes) that regex can save
167
- - Completely garbage output → safe_idle
168
- - Out-of-bounds coords → safe_idle
169
- - Hallucinated action_type (e.g., "nuke_fire") → safe_idle
170
- - Empty string → safe_idle
171
-
172
- Acceptance test:
173
- - `pytest tests/test_action_parser.py -v` passes all 8 tests.
174
- - Zero crashes across the test suite.
175
- - Status string is correctly reported for each case.
176
- ```
177
-
178
- ---
179
-
180
- ## Prompt 5 — Replay / GIF Renderer ✅ DONE
181
-
182
- ```
183
- Build a replay script that renders any episode as an animated GIF. This is critical for the storytelling score — every demo asset depends on it.
184
-
185
- Tasks:
186
- 1. Add `imageio` and `matplotlib` to `requirements.txt` if not present.
187
-
188
- 2. Create `scripts/replay.py` with CLI: `python scripts/replay.py --tier {easy|medium|hard} --seed {int} --agent {random|heuristic} --output {path.gif}`.
189
-
190
- 3. The script should:
191
- - Instantiate the env, run the agent, capture the full ground-truth `env.state()` at every step.
192
- - For each step, render a matplotlib figure (8x8 inches, 100 dpi) with:
193
- * Main panel (80% area): grid colored by cell state. Burning = red (intensity → color saturation), burned = dark gray, populated = blue square outline, firebreak = brown, crew = green circle with crew_id label, tanker drop zone = translucent cyan overlay.
194
- * Bottom strip: step number, cells burning, containment %, pop lost, wind arrow + speed.
195
- - Save all frames, stitch to GIF at 5 fps, write to output path.
196
- - Also save final-frame PNG to same path with `.png` extension.
197
-
198
- 4. Keep the rendering code in `env/rendering.py` (importable helpers), not inline in the script. Functions:
199
- - `render_frame(state: EnvState, step: int, stats: dict) -> np.ndarray` — returns RGB array.
200
- - `render_episode_gif(frames: List[np.ndarray], output_path: str, fps: int = 5)`.
201
-
202
- 5. Add `tests/test_rendering.py`:
203
- - `test_render_frame_produces_rgb` — reset env on easy, render frame, assert shape is (H, W, 3) and dtype is uint8.
204
- - `test_gif_creation` — run 20 steps of random agent, call `render_episode_gif`, assert output file exists and is > 10KB.
205
-
206
- Acceptance test:
207
- - `pytest tests/test_rendering.py -v` passes both tests.
208
- - Run: `python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demos/heuristic_medium_42.gif`
209
- - Open the GIF. Confirm it shows fire spreading, crews moving, and the stats strip updating. Paste the final-frame stats as confirmation.
210
- ```
211
-
212
- ---
213
-
214
- ## Prompt 6 — Curriculum Controller ✅ DONE
215
-
216
- ```
217
- Add a curriculum controller that auto-promotes through tiers based on rolling performance. This produces the characteristic "dip-and-recover" pattern on training curves that makes for compelling demo visuals.
218
-
219
- Tasks:
220
- 1. Create `env/curriculum.py` with class `CurriculumController`:
221
- - `__init__(self, start_tier: str = "easy", thresholds: Optional[dict] = None)` — default thresholds: easy→medium at 4.0 avg over 10 eps, medium→hard at 3.5 avg over 10 eps (these are total episode rewards under the new reward scheme, NOT [0,1]).
222
- - `after_episode(self, total_reward: float) -> Optional[str]` — returns the new tier name if a promotion just fired, else None.
223
- - `get_tier(self) -> str` — current tier.
224
- - `get_history(self) -> List[Tuple[int, str, float]]` — list of (episode_idx, tier, reward) for plotting.
225
- - `promotion_log: List[Tuple[int, str]]` — list of (episode_idx, new_tier) for marking vertical lines on plots.
226
-
227
- 2. Demote behavior: if recent 10-ep avg drops below (threshold * 0.5) after a promotion, demote back. Log this too.
228
-
229
- 3. Add `tests/test_curriculum.py`:
230
- - `test_promotion_fires_at_threshold` — feed 10 rewards of 5.0, assert promotion to medium.
231
- - `test_no_premature_promotion` — feed 5 rewards of 5.0, assert still on easy.
232
- - `test_demotion_on_collapse` — promote to medium, then feed 10 rewards of 0.5, assert demoted to easy.
233
- - `test_history_tracking` — run 20 episodes, assert history length is 20 and promotion_log is correctly populated.
234
-
235
- Acceptance test:
236
- - `pytest tests/test_curriculum.py -v` passes all 4 tests.
237
- - The controller is not yet wired into the env itself (that happens in the training notebook, Prompt 8). This prompt just builds the component.
238
- ```
239
-
240
- ---
241
-
242
- ## Prompt 7 — Eval Comparison Script ✅ DONE
243
-
244
- ```
245
- Build the eval comparison script that generates the headline comparison table for the pitch. This runs multiple agents on fixed seeds and outputs a clean comparison.
246
-
247
- Tasks:
248
- 1. Create `scripts/eval_compare.py` with CLI: `python scripts/eval_compare.py --seeds 42 43 44 45 46 --tiers medium hard --agents random heuristic base_llm trained_llm --output eval_results.json`.
249
-
250
- 2. Agent registry — a dict mapping agent name to a factory function:
251
- - `random` → existing RandomAgent
252
- - `heuristic` → existing HeuristicAgent
253
- - `base_llm` → LLM agent using `env/serialization.py` + `env/action_parser.py`, calling a base model (stub this for now — read model path from env var `BASE_MODEL_PATH`, default to None which skips this agent with a warning).
254
- - `trained_llm` → same pattern, env var `TRAINED_MODEL_PATH`.
255
-
256
- 3. For each (agent, tier, seed) combination:
257
- - Run the episode.
258
- - Record: final containment_pct, pop_saved_pct (= 1 - pop_lost/total_pop), total_reward, episode_steps.
259
-
260
- 4. Output:
261
- - A JSON file at the specified path with full results.
262
- - A printed table to stdout formatted like:
263
- ```
264
- === EVAL RESULTS — Medium Tier (5 seeds) ===
265
- Containment Pop Saved Reward Steps
266
- Random Agent 41% 60% -1.2 150
267
- Heuristic Agent 49% 71% +1.8 143
268
- Base LLM (Qwen) 38% 55% -0.9 150 [skipped — no model]
269
- Trained LLM (ours) 67% 89% +4.1 121 [skipped — no model]
270
- ```
271
- - Use mean across seeds for each column. Mark skipped agents clearly.
272
-
273
- 5. Add `--quick` flag that runs only easy tier with 2 seeds for smoke testing.
274
-
275
- 6. Add `tests/test_eval_compare.py`:
276
- - `test_quick_mode_runs` — invoke with --quick, assert eval_results.json exists, assert at least random and heuristic have non-null entries.
277
-
278
- Acceptance test:
279
- - `python scripts/eval_compare.py --quick` completes in under 2 minutes.
280
- - `pytest tests/test_eval_compare.py -v` passes.
281
- - Full run `python scripts/eval_compare.py --seeds 42 43 44 45 46 --tiers medium hard --agents random heuristic` produces the table. The LLM columns will show "[skipped — no model]" which is expected at this stage.
282
- ```
283
-
284
- ---
285
-
286
- ## Prompt 8 — GRPO Training Notebook (Colab) ✅ DONE
287
-
288
- ```
289
- Build the GRPO training notebook. This is a hackathon minimum requirement — without it we're technically DQ'd.
290
-
291
- Tasks:
292
- 1. Create `training/grpo_colab.ipynb` (a Jupyter notebook — JSON format). Use `nbformat` to construct it programmatically to avoid JSON escaping errors.
293
-
294
- 2. Notebook sections (each a separate cell with a markdown header cell above it):
295
-
296
- **Section 1: Setup**
297
- - pip install: `unsloth trl openenv-core pydantic numpy imageio matplotlib`
298
- - Clone the repo or install from path.
299
- - Import FastLanguageModel from unsloth, load `unsloth/Qwen2.5-1.5B-Instruct` in 4-bit with max_seq_length=2048.
300
- - Apply LoRA: r=16, alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"].
301
-
302
- **Section 2: Environment & Rollout**
303
- - Import WildfireEnv, serialize_observation, parse_action.
304
- - Define `collect_rollout(env, model, tokenizer, tier, seed) -> List[Dict]` that:
305
- * resets env
306
- * for each step: serializes obs → generates completion → parses action → steps env → records (prompt, completion, reward, step_status).
307
- * returns trajectory list.
308
- - Define `system_prompt` — a short, firm instruction to always output action as JSON only.
309
-
310
- **Section 3: GRPO Training Loop**
311
- - Use TRL's GRPOTrainer. Config: num_generations=8 per prompt, learning_rate=5e-6, max_steps=50, save_steps=10, per_device_train_batch_size=1, gradient_accumulation_steps=4.
312
- - Reward function: for each generation, run a mini-rollout (fresh env, same seed, sample action from the completion) and return the 1-step reward + discounted terminal if done. Cache seeds so generations for the same prompt see the same env state.
313
- - Wire in the CurriculumController from Prompt 6: after each full episode, call `controller.after_episode(total_reward)` and switch tier for the next episode.
314
-
315
- **Section 4: Checkpointing & Recovery**
316
- - Save LoRA adapter to `./checkpoints/step_{N}` every 10 steps.
317
- - Save a JSON of training stats (step, mean_reward, tier, parse_failure_rate) to `./training_stats.json` every step.
318
- - Add a "resume from checkpoint" cell at the top of Section 3 that loads the latest checkpoint if present.
319
-
320
- **Section 5: Plot Reward Curve**
321
- - Load training_stats.json, plot mean_reward vs step with matplotlib.
322
- - Save as `reward_curve.png`.
323
- - Mark tier promotions as vertical lines using controller.promotion_log.
324
-
325
- 3. Add `training/README.md` with:
326
- - How to open in Colab (a badge link).
327
- - Which cells to run in order.
328
- - Expected runtime on T4: ~45 min for 50 steps.
329
- - How to download the trained adapter.
330
-
331
- 4. Add `training/test_notebook_imports.py` — a plain Python file (not pytest) that imports every module the notebook uses and instantiates the env + tokenizer (skipping the model load). This catches broken imports before you open Colab.
332
-
333
- Acceptance test:
334
- - `python training/test_notebook_imports.py` runs without error.
335
- - Open the notebook locally with `jupyter nbconvert --to notebook --execute training/grpo_colab.ipynb --ExecutePreprocessor.timeout=600` — skip this if no GPU available locally (which is expected). Instead, validate notebook JSON with `jupyter nbconvert --to script training/grpo_colab.ipynb` and confirm the generated .py file has no syntax errors.
336
- - Confirm the notebook has exactly the 5 sections described, each with a markdown header.
337
- ```
338
-
339
- ---
340
-
341
- ## Prompt 9 — Training Curves Dashboard ✅ DONE
342
-
343
- ```
344
- Build a 4-panel training dashboard. Panel D (curriculum transitions) is the storytelling hook.
345
-
346
- Tasks:
347
- 1. Create `scripts/plot_dashboard.py` with CLI: `python scripts/plot_dashboard.py --stats training/training_stats.json --output training/training_dashboard.png`.
348
-
349
- 2. Layout: 2x2 matplotlib grid, figsize=(12, 8), dpi=100.
350
-
351
- - Panel A (top-left): Mean episode reward vs training step. Line plot with moving average (window=5) as a thicker overlay.
352
- - Panel B (top-right): Population survival rate (% of eps with zero pop loss) vs training step. Computed as rolling 10-ep fraction.
353
- - Panel C (bottom-left): Mean containment % at episode end, vs training step.
354
- - Panel D (bottom-right): Curriculum tier timeline. X-axis = episode index, Y-axis = tier (easy=0, medium=1, hard=2) drawn as a step function. Vertical dashed lines at promotion events with tier labels.
355
-
356
- 3. Handle missing data gracefully — if `training_stats.json` is absent, generate a synthetic stats file at `training/synthetic_stats_demo.json` with 50 fake training steps showing a plausible upward curve + one tier promotion, then plot from that (clearly label it "SYNTHETIC DEMO" in the figure title). This lets us test the plot script without a real training run.
357
-
358
- 4. Add `tests/test_dashboard.py`:
359
- - `test_synthetic_dashboard` — run `plot_dashboard.py` with no stats file, assert the synthetic PNG is created and > 50KB.
360
-
361
- Acceptance test:
362
- - `pytest tests/test_dashboard.py -v` passes.
363
- - Open the generated PNG. Confirm all 4 panels render, Panel D has visible vertical promotion lines, and the synthetic warning label is visible.
364
- ```
365
-
366
- ---
367
-
368
- ## Prompt 10 — Grader Alignment & Legacy Reward Cleanup ✅ DONE
369
-
370
- ```
371
- The existing graders in graders/ were written against the old [0,1] composite reward. Align them with the new decomposed reward so eval numbers are consistent between training and grading.
372
-
373
- Tasks:
374
- 1. Read `graders/grader_easy.py`, `graders/grader_medium.py`, `graders/grader_hard.py`. Identify where each one reads `result.reward` or computes a final score.
375
-
376
- 2. Update each grader:
377
- - Sum step rewards + terminal reward across the episode using the new decomposed structure.
378
- - Return the total episode reward as the grader's score.
379
- - Add a `details` dict to the grader return value: `{"total_reward": float, "containment_pct": float, "pop_saved_pct": float, "steps": int, "crew_casualty": bool}`.
380
-
381
- 3. Remove any references to `legacy_reward` that are no longer needed. Keep `legacy_reward` in StepResult.info for one more cycle (delete later), but graders should NOT use it.
382
-
383
- 4. Update `scripts/evaluate.py` to print the new detailed metrics alongside the reward.
384
-
385
- 5. Update `README.md` "Baseline Scores" table with the new reward scale. Re-run `python scripts/evaluate.py 5` and paste the new numbers. Expected pattern: heuristic should now clearly beat random on ALL tiers under the new reward. If it doesn't, flag it — this is a diagnostic signal that the reward or the heuristic needs work.
386
-
387
- 6. Add `tests/test_graders.py`:
388
- - `test_each_grader_returns_float_and_details` — run each of the 3 graders with the heuristic agent, assert return structure.
389
- - `test_grader_scores_are_in_expected_range` — assert easy total_reward > 3.0 for heuristic, medium > 1.0, hard > 0.0 (generous lower bounds).
390
-
391
- Acceptance test:
392
- - `pytest tests/test_graders.py -v` passes.
393
- - `python scripts/evaluate.py 5` produces a table where heuristic beats random on every tier. Paste the output. If heuristic loses on any tier, investigate before proceeding — this likely indicates a variance or reward issue.
394
- - README "Baseline Scores" section is updated with new numbers.
395
- ```
396
-
397
- ---
398
-
399
- ## Prompt 11 — Demo Seed Finder + Demo Runner ✅ DONE
400
-
401
- ```
402
- Find a fixed seed that produces a clean, visually obvious contrast between heuristic and (eventually) trained-LLM behavior on medium tier. This becomes the 3-minute pitch demo.
403
-
404
- Tasks:
405
- 1. Create `scripts/find_demo_seed.py`:
406
- - Iterate seeds 0..500 on medium tier.
407
- - For each seed, run the HEURISTIC agent, record: total_reward, pop_saved_pct, wind_shift_step (if any), and whether at least one populated cell was lost.
408
- - Filter for seeds where: (a) a wind shift fires between step 60–90, (b) heuristic loses at least one populated cell, (c) heuristic total_reward is between 0.0 and +2.0 (i.e., a flawed but not catastrophic baseline �� gives room for improvement).
409
- - Output top 5 candidate seeds to `demos/candidate_seeds.json` with a short description of each.
410
-
411
- 2. Create `scripts/run_demo.py` with CLI: `python scripts/run_demo.py --seed {int}`:
412
- - Runs heuristic on medium tier with that seed, generates GIF to `demos/heuristic_demo.gif` using the Prompt 5 renderer.
413
- - Prints a play-by-play narrative: "Step 45: fire approaches populated cell (12, 8). Step 60: wind shifts. Step 75: crew committed to wrong flank. Step 89: populated cell burns."
414
- - If `--agent trained_llm` is passed and a TRAINED_MODEL_PATH env var exists, also runs the trained model and saves `demos/trained_demo.gif` + a second narrative.
415
- - Print a clean side-by-side comparison at the end: both agents' final stats.
416
-
417
- 3. Pick ONE seed from the top 5 as `DEMO_SEED`. Hardcode it as a constant in `scripts/run_demo.py`: `DEMO_SEED = <chosen_seed>`. The `--seed` flag defaults to this. Document the narrative for this specific seed in a comment block at the top of the file.
418
-
419
- 4. Add `demos/README.md` explaining how to regenerate demo assets.
420
-
421
- Acceptance test:
422
- - `python scripts/find_demo_seed.py` completes in under 10 minutes, outputs candidate_seeds.json.
423
- - `python scripts/run_demo.py` (with default seed) produces heuristic_demo.gif and prints a coherent narrative. Confirm the narrative matches what the GIF actually shows.
424
- - Paste the chosen DEMO_SEED value.
425
- ```
426
-
427
- ---
428
-
429
- ## Prompt 12 — Theme 2 Framing: Operational Briefing System
430
-
431
- ```
432
- Add a structured operational briefing that the env produces on reset(). The agent receives this as part of its first observation. This is what pivots the environment into Theme 2 (Long-Horizon Planning & Instruction Following) framing — judges need to see instruction-following as a first-class feature.
433
-
434
- Tasks:
435
- 1. Create `env/briefing.py` with:
436
- - Pydantic model `OperationalBriefing` with fields: `incident_id: str`, `ignition_cause: str`, `priority_populated_zones: List[Tuple[int, int]]` (cells the agent must prioritize protecting), `priority_infrastructure: List[Tuple[int, int]]` (e.g., road cells, optional), `forecast_events: List[str]` (e.g., "Wind shift southwest expected by step 60"), `declared_time: str` (narrative time like "04:00").
437
- - Function `generate_briefing(tier_config, rng) -> OperationalBriefing` that synthesizes a plausible briefing from the tier config. For populated priorities, pick the top 2 largest pop clusters. For forecast events, derive from the weather schedule if the engine exposes scheduled wind shifts; otherwise generate 1-2 plausible generic forecasts.
438
- - Function `briefing_to_text(briefing: OperationalBriefing) -> str` — formats as a natural-language briefing block:
439
- ```
440
- === OPERATIONAL BRIEFING ===
441
- Incident {incident_id} declared at {declared_time}.
442
- Cause: {ignition_cause}.
443
-
444
- PRIORITY 1: Protect populated zones at {coords list with cell names}.
445
- PRIORITY 2: Maintain {infrastructure} open where possible.
446
-
447
- FORECAST:
448
- - {forecast_1}
449
- - {forecast_2}
450
-
451
- Commander's intent: Contain fire with zero civilian casualties. Preserve crew safety.
452
- ```
453
-
454
- 2. Update `env/models.py`:
455
- - Add `briefing: Optional[OperationalBriefing]` field to `Observation`. Populated only on the first observation after reset; subsequent observations can reuse or omit.
456
-
457
- 3. Update `env/wildfire_env.py`:
458
- - On reset, generate a briefing and attach to the first observation.
459
- - Store `self.active_briefing` for the episode so reward logic can reference it.
460
-
461
- 4. Update `env/reward.py` compute_terminal_reward:
462
- - Add a `briefing_adherence_bonus`: +1.0 if all priority_populated_zones survived, 0 otherwise.
463
- - Stack this on top of the existing terminal reward.
464
-
465
- 5. Update `env/serialization.py` serialize_observation:
466
- - If `obs.briefing` is present, prepend `briefing_to_text(obs.briefing)` above the SITUATION block.
467
- - Subsequent steps: include a shortened reminder like "Priority zones: (r1,c1), (r2,c2) — still standing" or "— 1 LOST".
468
-
469
- 6. Add `tests/test_briefing.py`:
470
- - `test_briefing_generated_on_reset` — reset on medium, assert obs.briefing is not None and has ≥1 priority zone.
471
- - `test_briefing_adherence_bonus` — run heuristic successfully saving priority zones, assert terminal includes the +1.0.
472
- - `test_briefing_in_serialized_prompt` — serialize first obs, assert "OPERATIONAL BRIEFING" substring is present.
473
-
474
- Acceptance test:
475
- - `pytest tests/test_briefing.py -v` passes all 3 tests.
476
- - Run the serializer manually on a fresh medium reset and confirm the briefing reads coherently. Paste the output.
477
- - Re-run `python scripts/evaluate.py 5`. Reward numbers will shift slightly due to the new bonus — that's expected. Paste the new numbers.
478
- ```
479
-
480
- ---
481
-
482
- ## Prompt 13 — README Rewrite for Finale Framing
483
-
484
- ```
485
- Rewrite the README to frame this as a finale submission aligned with Theme 2 (Long-Horizon Planning & Instruction Following). Keep all the technical depth but re-lead with the finale narrative.
486
-
487
- Tasks:
488
- 1. Replace the current README.md top section (above "Real-World Motivation") with:
489
-
490
- ```markdown
491
- # Wildfire Containment Simulator
492
-
493
- **OpenEnv Finale Submission — Theme 2: Long-Horizon Planning & Instruction Following**
494
-
495
- ![Training Demo](demos/heuristic_demo.gif)
496
-
497
- A partially-observable disaster simulation where an LLM acts as Incident Commander, interpreting operational briefings, tracking state across 300-step episodes, and recovering from cascading failures. Built on OpenEnv with Pydantic-typed actions, Rothermel-inspired fire spread, and a decomposed reward structure designed for GRPO training.
498
-
499
- **Headline result:** Our trained Qwen-2.5-1.5B IC achieves {X}% population survival on Hard tier vs. {Y}% for the rule-based heuristic baseline. See [HF blog post]({link}) for details.
500
-
501
- ## Quick Links
502
- - 🔥 **HF Space (live env):** {link}
503
- - 📒 **Training notebook (Colab):** [training/grpo_colab.ipynb]({link})
504
- - 📊 **Eval results:** [eval_results.json]({link})
505
- - 🎬 **Demo:** `python scripts/run_demo.py`
506
- - 📝 **Blog post:** {link}
507
- ```
508
-
509
- 2. Add a new section right after the quick links called **"Why Theme 2"**:
510
- - 3 bullets explaining long-horizon planning (300 steps, sparse terminal reward), instruction following (operational briefings), and recovery from early mistakes (staggered ignitions, crew loss events).
511
-
512
- 3. Keep all existing sections (Environment API, Action Space, Observation Space, Reward Function, Tiers, Fire Spread Model, Project Structure, Key Design Decisions).
513
-
514
- 4. Update the **Reward Function** section to describe the new decomposed structure (step rewards + terminal spikes), not the old [0,1] composite.
515
-
516
- 5. Add a new **"Baseline Scores"** table with post-training numbers. If training hasn't completed yet, use placeholder `{TBD}` and add a prominent note: "Numbers will be updated post-training on April 24."
517
-
518
- 6. Add a **"Reproducing Our Results"** section:
519
- - How to run baseline evals.
520
- - How to open the Colab notebook.
521
- - How to run the demo seed.
522
- - How to render replays.
523
-
524
- Acceptance test:
525
- - README renders cleanly on GitHub (preview via VSCode or `grip`).
526
- - All links are either live or clearly marked as placeholders.
527
- - The first screenful (hero + quick links + theme justification) is self-contained — a judge can get the pitch in 30 seconds without scrolling.
528
- ```
529
-
530
- ---
531
-
532
- ## Prompt 14 — CI & Final Repo Polish
533
-
534
- ```
535
- Add CI and final-mile polish. This is the "looks professional on GitHub" pass.
536
-
537
- Tasks:
538
- 1. Create `.github/workflows/ci.yml`:
539
- - Triggers: push to main, PRs.
540
- - Runs: setup Python 3.10, install requirements, run `pytest tests/ -v --cov=env --cov-report=term`.
541
- - Cache pip dependencies.
542
- - Required checks: all tests pass.
543
-
544
- 2. Add a coverage badge and CI badge to the top of README (below the title):
545
- ```
546
- ![CI](https://github.com/Abrodolph/Wildfire-Containment-Simulator/actions/workflows/ci.yml/badge.svg)
547
- ![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)
548
- ![Theme](https://img.shields.io/badge/Theme-2%20Long%20Horizon-orange)
549
- ```
550
-
551
- 3. Create `LICENSE` file with MIT license (to match the README frontmatter).
552
-
553
- 4. Audit `openenv.yaml` against the latest OpenEnv spec — fetch the latest spec from the openenv repo (github.com/meta-pytorch/openenv if that's the canonical URL at the time of writing) and verify field names, required properties, and schema version. Report any discrepancies and fix them.
554
-
555
- 5. Clean up `pyproject.toml`:
556
- - Pin Python to `>=3.10`.
557
- - Ensure all console_scripts point to existing entry points (no dead references).
558
- - Move `pytest`, `pytest-cov` to `[project.optional-dependencies]` under a `dev` extra.
559
-
560
- 6. Add `CONTRIBUTING.md` (brief — 15 lines is fine) explaining how to add a new tier, how to add a new action type, and where tests live.
561
-
562
- 7. Run `python -c "import env; import server; from env.wildfire_env import WildfireEnv; WildfireEnv().reset(task_id='easy', seed=0)"` as a final smoke test.
563
-
564
- Acceptance test:
565
- - CI badge appears (may show pending until the first push).
566
- - `pytest tests/ -v --cov=env` runs clean locally and reports >60% coverage on env/.
567
- - OpenEnv spec audit is completed — paste any discrepancies found and confirm they're fixed.
568
- - Repo root looks clean: no `__pycache__`, no `{env,...}` artifacts, no nested duplicate folder.
569
- ```
570
-
571
- ---
572
-
573
- ## Prompt 15 (OPTIONAL — Only if P1 complete by April 24 evening) — Multi-Agent Crew Architecture
574
-
575
- ```
576
- OPTIONAL: Only execute this if Prompts 1-14 are complete AND the training run has produced a working reward curve. Otherwise skip — this is a high-risk refactor close to deadline.
577
-
578
- Convert crews from passive tools into semi-autonomous sub-agents. This legitimizes the Halluminate sub-theme claim (Theme 1: Multi-Actor Environments) as a secondary pitch angle.
579
-
580
- Tasks:
581
- 1. Add `local_observation` method to Crew in `env/resources.py`:
582
- - Returns a 3×3 neighborhood view centered on the crew's position (fire_state, intensity, smoke), plus crew's own health state.
583
-
584
- 2. Add a `local_policy` function per crew:
585
- - Rule-based: if intensity at current cell > 0.8, retreat one cell away from fire center. Otherwise move toward nearest visible fire in the 3×3 window. If no fire visible, hold position.
586
- - Crews execute this policy automatically each step UNLESS the IC's most recent order overrides.
587
-
588
- 3. Change IC action space:
589
- - Keep existing `MOVE_CREW(crew_id, direction)` but re-label semantically as `ORDER_CREW_MOVE`.
590
- - Add `ORDER_CREW_OBJECTIVE(crew_id, objective: Literal["hold", "advance", "retreat", "prioritize_north", "prioritize_south", "prioritize_east", "prioritize_west"])` — the crew's local policy then biases toward that objective.
591
- - If the IC issues no order in a given step, crews follow their local_policy autonomously.
592
-
593
- 4. Reward impact:
594
- - Add tracking for "autonomous saves" — when a crew retreats on its own local_policy and avoids a casualty that would have otherwise occurred. Log these; they become a talking point ("our crews saved themselves 3 times in this episode without IC instruction").
595
-
596
- 5. Add `tests/test_multi_agent.py`:
597
- - `test_crew_retreats_from_high_intensity` — construct scenario with intensity spike at crew's cell, assert crew moves away next step even with no IC order.
598
- - `test_ic_order_overrides_local_policy` — assert `ORDER_CREW_MOVE` still works when issued.
599
- - `test_autonomous_save_tracking` — count autonomous_saves after a scripted scenario.
600
-
601
- 6. Update `env/serialization.py` to include crew local observations in the prompt under a new `CREW REPORTS` section (each crew reports what they see and what they're doing).
602
-
603
- 7. Update README to add a "Multi-Agent Architecture" section describing the IC/crew decomposition.
604
-
605
- Acceptance test:
606
- - `pytest tests/test_multi_agent.py -v` passes all 3 tests.
607
- - Run `python scripts/run_demo.py` — confirm the narrative now includes autonomous crew moments.
608
- - If ANYTHING breaks the existing test suite, revert the changes immediately. This prompt must not destabilize P1 deliverables.
609
- ```
610
-
611
- ---
612
-
613
- ## Final Checklist (Run Before Submission)
614
-
615
- Run these commands sequentially. All must pass.
616
-
617
- ```bash
618
- # 1. All tests green
619
- pytest tests/ -v
620
-
621
- # 2. Baseline eval produces expected pattern
622
- python scripts/evaluate.py 5
623
-
624
- # 3. Eval comparison runs
625
- python scripts/eval_compare.py --seeds 42 43 44 45 46 --tiers medium hard --agents random heuristic
626
-
627
- # 4. Demo runs cleanly
628
- python scripts/run_demo.py
629
-
630
- # 5. Dashboard generates
631
- python scripts/plot_dashboard.py --stats training/training_stats.json --output training/training_dashboard.png
632
-
633
- # 6. Replay generates
634
- python scripts/replay.py --tier medium --seed 42 --agent heuristic --output demos/heuristic_medium_42.gif
635
-
636
- # 7. Notebook imports work
637
- python training/test_notebook_imports.py
638
-
639
- # 8. Env still serves
640
- python app.py &
641
- sleep 3
642
- curl http://localhost:7860/health
643
- kill %1
644
- ```