Eshit commited on
Commit
1da3c5b
·
1 Parent(s): 423d538

Rewrite BLOG.md as a personal narrative for hackathon submission

Browse files
Files changed (1) hide show
  1. BLOG.md +47 -192
BLOG.md CHANGED
@@ -1,246 +1,101 @@
1
- # Teaching a 1.5B-class Language Model to Fight Wildfires with GRPO
2
 
3
- *A frank write-up of what we built, what worked, and what broke — for the Meta OpenEnv Hackathon, Theme 2: Long-Horizon Planning & Instruction Following.*
4
-
5
- > **TL;DR.** We built a partially-observable wildfire-response RL environment on OpenEnv, generated 4,300 supervised examples from a hand-coded heuristic, did a 1-epoch SFT warm-up on Qwen-2.5-7B-Instruct, then ran GRPO with a curriculum that auto-promotes the agent across three difficulty tiers. The trained agent reaches **{TBD}** mean reward on Hard tier (heuristic baseline: +4.74; random: +2.16). Code, env, training notebooks, and a live HF Space are all linked from the [`README`](README.md).
6
-
7
- ---
8
-
9
- ## Why wildfires?
10
-
11
- Most RL environments for language models are puzzles, games, or code tasks. We wanted something with three properties at once:
12
-
13
- 1. **Long-horizon, sparse-terminal reward.** A real plan has to survive 100+ steps before the result lands.
14
- 2. **Partial observability that *gets worse* during the episode.** Smoke spreads, recon expires, fog-of-war hides what hasn't been scouted recently.
15
- 3. **An explicit instruction-following channel.** A first-step "operational briefing" the agent must read, internalize, and adhere to — and a reward term that rewards adherence.
16
-
17
- Wildfire incident command hits all three. An incident commander gets a briefing, has hard resource limits (crews, air tankers, firebreak budget, recon), and has to balance speed vs. coverage vs. civilian safety while wind, slope, and humidity all change underneath them. We turned that into a structured grid environment with typed actions, an `OperationalBriefing` on reset, and a decomposed reward — and then trained an LLM to play the role of the IC.
18
 
19
  ---
20
 
21
- ## The environment, top down
22
-
23
- The environment is OpenEnv-compliant: `reset(task_id, seed) → Observation`, `step(Action) → StepResult`, `state() → dict`. Three difficulty tiers, all runnable on the same code path:
24
 
25
- ```
26
- Easy → 15×15 flat grid, 1 ignition, constant wind, 80 steps
27
- Medium → 25×25 canyon terrain, 2 ignitions, wind shifts, smoke, 150 steps
28
- Hard → 40×40 wildland-urban interface, staggered ignitions,
29
- fog-of-war, mid-episode crew casualty, 300 steps
30
- ```
31
 
32
- The agent never directly applies suppression. It positions resources crews, tankers, firebreaks, recon flights and the environment computes the resulting fire dynamics each tick. The 11-step tick pipeline is fully deterministic given a seed:
33
 
34
- ```
35
- validate(action) → execute(action) → spread_fire → apply_suppression
36
- → evolve_weather → update_moisture → propagate_smoke → tick_cooldowns
37
- → expire_recon → trigger_scripted_events → compute_reward → check_termination
38
- ```
39
 
40
- **Fire spreads via a Rothermel-inspired cellular automaton.** Every burning cell rolls against each of its 8 neighbors:
41
 
42
- ```
43
- P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
44
- × (1 − moisture) × (1 − suppression) × tier_scale
45
- ```
46
 
47
- Wind alignment dominates spread direction. Slope speeds uphill spread. Suppression from crew presence and tanker drops is *spatial* — it only affects the cells you've actually committed resources to.
48
 
49
  ---
50
 
51
- ## Speaking the agent's language
52
 
53
- A 7B chat model can't natively read a 40×40 grid of cell objects. So we built two adapters between the env and the LLM:
54
 
55
- **`serialize_observation()`** Turns the raw `Observation` into a structured prompt:
56
- - BFS-clusters fire cells into bounding boxes ("3 BURNING clusters near rows 7–12, cols 3–8") so prompt length is `O(regions)` not `O(cells)`.
57
- - Lists resource state with cooldown warnings.
58
- - Surfaces the last 5 notable events.
59
- - Notes weather noise levels explicitly so the model knows readings are not exact.
60
 
61
- **`parse_action()`** A 3-layer LLM-output `Action` mapper:
62
- 1. Strip code fences, find a JSON object, parse it directly.
63
- 2. If JSON parsing fails: regex-extract `action_type` and per-action fields.
64
- 3. Final fallback: return a safe `IDLE`. The training loop never breaks on bad model output.
65
-
66
- That parser fallback is also a **defense against reward hacking** — there's no clever output that crashes the env or skips a step. Worst case the model burns a step on `IDLE` and pays the small step penalty.
67
 
68
  ---
69
 
70
- ## The reward, designed for GRPO
71
-
72
- GRPO computes advantages by comparing rollout rewards within a group of completions for the same prompt. If your reward signal is too narrow (e.g. all rewards in `[0, 1]`), the advantages collapse and the gradient washes out. We deliberately built a wide-range, decomposed reward.
73
 
74
- **Per-step (dense):**
75
- ```
76
- step_reward = 0.4·Δcontainment + 0.4·Δpop_safety − 0.1·redundant_action
77
- ```
78
 
79
- **Terminal (sparse, on episode end):**
80
- ```
81
- +5.0 if zero population lost
82
- +0–2.0 efficiency bonus (faster containment ⇒ more)
83
- +1.0 briefing-adherence bonus (all priority zones survived)
84
- −3.0 · (pop_lost / total_pop) if any population lost
85
- −2.0 if any crew casualty
86
- −0.01 × invalid_action_count capped at −0.2
87
- ```
88
 
89
- Empirical range: **−8 to +8**. That's a 16-point span, enough for clear advantages between rollout groups.
90
-
91
- **Two reward functions, not one.** For GRPO we register two reward functions with TRL:
92
- - `reward_fn_outcome` — the full episodic reward described above (computed by *running the full episode*, see "What broke" below).
93
- - `reward_fn_format` — a tiny standalone JSON-format check (`+0.15` for valid JSON with a recognized `action_type`, `0.0` for valid JSON with an unknown type, `−0.20` for unparseable garbage). This rewards good formatting independently from policy quality.
94
-
95
- This is the "multiple independent reward functions" pattern from the OpenEnv hackathon guide — and it cost us about 30 lines of code.
96
 
97
  ---
98
 
99
- ## Training, in two stages
100
-
101
- ### Stage 1 — SFT warm-up (~30 min)
102
-
103
- We harvested 4,300 `(prompt, action_json)` pairs from `HeuristicAgent` rollouts on successful episodes (filtered to `population_lost == 0`):
104
-
105
- | Tier | Examples |
106
- |---|---|
107
- | Easy | 2,000 |
108
- | Medium | 1,500 |
109
- | Hard | 800 |
110
 
111
- Then 1 epoch of SFT on Qwen-2.5-7B-Instruct via Unsloth 4-bit + LoRA (`r=32`, `α=64`, target modules: `q,k,v,o,gate,up,down`). The aim is **format priming**, not policy quality we just want the model to reliably emit valid JSON `Action` objects so GRPO has something to optimize against. Going straight from base model to GRPO produced near-zero reward in our early experiments because most completions parsed as `IDLE`.
112
 
113
- ### Stage 2 GRPO with curriculum (~75 min)
114
-
115
- Starting from the SFT adapter, we run TRL's `GRPOTrainer` with 8 generations per prompt, `learning_rate=3e-6`, `max_completion_length=192`. The reward function is the key piece:
116
-
117
- ```python
118
- def reward_fn_outcome(completions, prompts, tier=None, seed=None, **kwargs):
119
- rewards = []
120
- for i, completion in enumerate(completions):
121
- env = WildfireEnv()
122
- # CRUCIAL: replay the EXACT (tier, seed) that produced this prompt
123
- obs = env.reset(task_id=tier[i], seed=seed[i])
124
- action, _ = parse_action(completion, obs)
125
- result = env.step(action)
126
- total = result.reward
127
- # Heuristic carries the episode to completion so terminal reward fires
128
- heuristic = HeuristicAgent()
129
- while not env.done:
130
- result = env.step(heuristic.act(env._current_obs))
131
- total += result.reward
132
- rewards.append(total)
133
- return rewards
134
- ```
135
-
136
- The `CurriculumController` watches a rolling 10-batch reward average and promotes the dataset from easy → medium → hard. A `TrainerCallback` rebuilds the prompt dataset whenever a promotion fires, so prompts and reward states stay synchronized.
137
 
138
  ---
139
 
140
- ## What broke (and what we fixed)
141
-
142
- We're including this section because we think the bugs are more interesting than the headline numbers.
143
-
144
- ### v1 GRPO bug #1 — Frozen dataset, live curriculum
145
 
146
- Our first GRPO run promoted the controller to medium at step 10 and to hard at step 20 — but the prompt dataset was built once before `trainer.train()` and *never refreshed*. So from step 10 onward the controller said "we're on hard" but the model was still being scored on easy-tier prompts. Training stats looked fine; the model wasn't actually learning the harder tasks.
147
 
148
- **Fix:** add a `TrainerCallback.on_step_end` that compares `controller.get_tier()` against the last seen tier and rebuilds the train dataset from scratch when they diverge.
149
 
150
- ### v1 GRPO bug #2 Truncated rollouts never saw terminal reward
151
 
152
- The first reward function ran for a fixed 15 steps, applying the LLM action at step 0 and the heuristic for 14 more steps. But hard tier has `min_active_steps=80`, so the +5.0 terminal reward never fired during training. GRPO advantages were dominated by ±0.5 per-step deltas, not the ±5 terminal spikes the reward was *designed* around.
153
 
154
- **Fix:** in v2, the reward function runs the full episode to `env.done`. This makes training slower but *the gradient signal is now comparable to baseline reward*.
155
-
156
- ### v1 GRPO bug #3 — Prompt/reward state mismatch
157
-
158
- The most insidious bug. The dataset's prompts were generated from `(tier, seed=fresh_random)`. The reward function then **picked a different random seed** to roll out against. So the model was being scored in a completely different env state than the one shown in its prompt. Imagine being asked "what would you do here?" while shown a photo of New York, and graded on what would have happened in Tokyo.
159
-
160
- **Fix:** every dataset row stores its `seed`. The reward function reads `seed` from `kwargs` (TRL passes dataset columns through as kwargs) and resets the env to that exact `(tier, seed)`. Prompt state and reward state are now identical.
161
-
162
- ### v1 GRPO bug #4 — Wasted inner generations
163
-
164
- The v1 reward function called `model.generate()` *seven extra times per completion* to build a multi-step rollout. But GRPO gradients only flow through the originally sampled completion — those 7 extra generations were expensive noise.
165
-
166
- **Fix:** `MODEL_STEPS = 1`. The model's sampled completion is applied as the step-0 action; the heuristic carries the rest. The wall-clock per training step dropped by ~70%.
167
-
168
- ### v1 GRPO bug #5 — Crash on format-only reward
169
-
170
- We tried to add a format-validity reward early on, but `parse_action(text, obs)` reads `obs.grid` to validate spatial fields. Calling it with `obs=None` for a pure format check crashed.
171
-
172
- **Fix:** a standalone `check_json_format(text)` function that doesn't need an obs. Three-state output (`json_success / regex_fallback / safe_idle`) → reward `(+0.15 / 0 / −0.20)`.
173
-
174
- We're being open about these bugs because we think *the post-mortem matters more than the leaderboard.* Anyone training GRPO on a custom OpenEnv environment is likely to hit at least three of these five.
175
 
176
  ---
177
 
178
- ## Results
179
 
180
- > Final numbers are produced by `python scripts/eval_trained_model.py --num-seeds 15 --tiers easy medium hard` on held-out seeds 200–214.
181
 
182
- | Agent | Easy | Medium | Hard |
183
- |---|---|---|---|
184
- | Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
185
- | Heuristic | +7.53 ± 0.08 | +6.31 ± 2.77 | +4.74 ± 3.79 |
186
- | **Trained Qwen-2.5-7B (ours)** | **{TBD}** | **{TBD}** | **{TBD}** |
187
 
188
- **JSON success rate (trained agent):** Easy {TBD}% · Medium {TBD}% · Hard {TBD}% the SFT warm-up's job.
189
 
190
- **Population-saved %:** Easy {TBD}% · Medium {TBD}% · Hard {TBD}% the headline safety metric.
191
 
192
- The training reward curve and tier-promotion timeline are in [`training/training_dashboard.png`](training/training_dashboard.png); the full W&B run is at *(link added post-run)*.
193
-
194
- ### What the trained agent learned (qualitatively)
195
-
196
- *Filled in post-run from inspection of `training/samples/call_*.txt`:*
197
-
198
- - {TBD: behavioral pattern 1, e.g. "tends to drop retardant ahead of wind direction rather than reactively"}
199
- - {TBD: behavioral pattern 2, e.g. "deploys crews around priority zones first, even when fire is closer to lower-priority cells"}
200
- - {TBD: behavioral pattern 3, e.g. "saves recon for mid-episode after staggered ignition fires"}
201
-
202
- ---
203
-
204
- ## Key learnings
205
-
206
- 1. **Reward decomposition matters more than model size.** A wide, structured reward gave a 7B model enough signal to surpass random and approach the heuristic on medium. We expect a 1.5B model would also work — the bottleneck is reward design, not parameters.
207
- 2. **Curriculum is essential for long-horizon tasks.** Throwing hard tier directly at the SFT model produced near-zero gradient signal — the +5 terminal bonus was almost never observed. Easy → medium → hard with auto-promotion was the difference.
208
- 3. **Format compliance must be a first-class reward, not an afterthought.** The format-only reward function (`+0.15 / 0 / −0.20`) cost us 30 lines and meaningfully reduced parse-failure rate during training. It also makes the JSON success rate trackable as an independent metric.
209
- 4. **Replay the prompt's exact env state when scoring completions.** Stochastic env resets in your reward function turn GRPO into "what's a good action *somewhere*?" instead of "what's a good action *here*?". The latter is what you actually want.
210
- 5. **Heuristic continuation is a powerful variance-reduction trick.** Letting the heuristic finish each rollout reduces noise from the model's later (uncertain) actions, so the gradient signal mostly reflects the *first* action's quality. Combined with full-episode rollout, you get terminal reward without 300 model.generate() calls per training step.
211
- 6. **Inspect generations on disk every N steps.** TRL's stdout logging shows you `mean_reward` only. Saving the first completion of each batch to `training/samples/call_{n}.txt` is what catches reward hacking and format regressions before they become catastrophic.
212
 
213
  ---
214
 
215
- ## Limitations and future work
216
-
217
- - **Heuristic continuation is a double-edged sword.** It reduces variance, but the reward attributes a good outcome to the model's first action even when the heuristic deserves most of the credit. A planned ablation: train one model with heuristic continuation and one with full-model rollout, compare on held-out seeds.
218
- - **Hard tier still has high variance.** Heuristic std on hard is ±3.79 — bimodal between full saves and total losses. Smoothing the ignition-spawn distribution (`_find_ignition_candidate` in `wildfire_env.py`) would reduce this.
219
- - **Single-tenant FastAPI server.** The HF Space currently uses a module-level `_env` singleton. Two concurrent users would clobber each other's episode. Per-session env binding via cookie/header is a 30-line fix we deferred.
220
- - **Held-out generalization untested at scale.** We evaluate on seeds 200–214 (15 per tier) which don't appear in the 0–99 training pool. A larger holdout (say 200–999) would tighten the confidence intervals.
221
- - **No multi-agent coordination experiments yet.** Each crew already runs a local autonomous policy; an obvious next step is to also let multiple LLM ICs collaborate on a shared incident.
222
-
223
- ---
224
 
225
- ## Acknowledgments
226
 
227
- - **Meta** and **Hugging Face** for the OpenEnv hackathon, the OpenEnv spec, and Hugging Face Spaces.
228
- - **Scaler** for being an amazing host, had great fun interacting with participants from various parts of the country as well as walks of life.
229
- - **Unsloth** for fast 4-bit LoRA training on consumer/colab GPUs.
230
- - **The TRL team** for `GRPOTrainer`, especially the multi-reward-function support.
231
- - The Rothermel surface-fire spread model, which has shaped wildfire science since 1972 even our toy version owes its structure to that work.
232
 
233
  ---
234
 
235
  ## Links
236
 
237
- - 🚀 Live env on Hugging Face: [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator)
238
- - 💻 Source on GitHub: [`Abrodolph/Wildfire-Containment-Simulator`](https://github.com/Abrodolph/Wildfire-Containment-Simulator)
239
- - 📒 GRPO notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb)
240
- - 📒 SFT notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb)
241
- - 📊 Baselines: [`scripts/results.json`](scripts/results.json)
242
- - 📈 Training dashboard: [`training/training_dashboard.png`](training/training_dashboard.png)
243
- - 🎬 Heuristic replay: [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif)
244
- - 📄 Top-level overview: [`README.md`](README.md)
245
 
246
- *— Team Wildfire, April 2026*
 
1
+ # The Long Walk to an Incident Commander
2
 
3
+ *How a stray clip about the LA fires turned into my first reinforcement learning project.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ---
6
 
7
+ ## The empty shortlist
 
 
8
 
9
+ When the OpenEnv hackathon was announced, I did not have an idea.
 
 
 
 
 
10
 
11
+ I had a shortlist, sure, the kind every hobbyist machine-learning person carries around in the back of their head. A code-review agent. A structured-output LLM judge. Something to do with compliance reports, an idea I abandoned within a day because the moment I tried to write a one-paragraph pitch I started yawning. There is a useful diagnostic in that, I think. If your own pitch bores you in draft, it will bore everybody else louder.
12
 
13
+ So I scrolled. Embarrassingly enough, that is where the project really starts.
 
 
 
 
14
 
15
+ It was around one in the morning, and I was doom-scrolling YouTube shorts in the kind of fugue where individual videos stop registering and become a slurry of noise. An NBC Bay Area clip drifted past. Then aerial footage of the LA fires, hills the colour of rust, voice-overs droning numbers about acres and containment percentages. What I remember is the way one reporter said, almost in passing, *"the incident commander has decided to pull crews back"*. A single human being, deciding at three in the morning their time, where to put the people, where to drop the retardant, which neighbourhoods to write off. I closed the app, went to bed, and didn't think about it again for a couple of days.
16
 
17
+ But it kept coming back. Quietly, at first, while I was supposed to be reading the hackathon brief. The themes mapped almost too cleanly onto the shape of the thing that incident commander on the clip was doing: incomplete information, hard resource limits, weather you cannot argue with, civilians you cannot disappoint. I scribbled *"Wildfire Incident Commander as an RL task"* on a sticky note and pressed it to the side of my monitor. The note is still there, slightly buckled at the corner now, with coffee on it.
 
 
 
18
 
19
+ There was one problem with this plan. I had never actually trained a reinforcement learning policy.
20
 
21
  ---
22
 
23
+ ## How I got into RL in the first place
24
 
25
+ My exposure to RL up to this point was almost entirely cinematic.
26
 
27
+ I had grown up on those grainy DeepMind Atari videos, the ones where a tiny green paddle slowly figures out it can tunnel through the side of a Breakout wall and bounce the ball around behind it. I remember rewatching that clip on loop the first time I saw it and feeling something genuinely uncanny. The agent had not been told about the tunnel. Nobody coded the tunnel. It just appeared, somewhere in the loss landscape, as the cheapest way to keep the ball alive.
 
 
 
 
28
 
29
+ From there it was the usual pipeline. The *AlphaGo* documentary, late one weekend. OpenAI's hide-and-seek video where the agents start surfing on boxes their opponents are still trying to lock down. Two Minute Papers explaining AlphaStar and OpenAI Five with that delighted Hungarian cadence. I read Sutton and Barto in the way most people read Sutton and Barto, which is to say, three chapters in great detail and the rest in spirit. I read the Mnih DQN paper, the Schulman PPO paper, eventually the DeepSeek-R1 work and the GRPO derivations, and I poked at a couple of CartPole notebooks. But I had never actually trained a policy that mattered. RL had this folkloric reputation around it, finicky, expensive, vibes-based, the part of the deep-learning toolbox most likely to silently fail in interesting ways. I had a healthy fear of the field.
 
 
 
 
 
30
 
31
  ---
32
 
33
+ ## The confession
 
 
34
 
35
+ So the confession I should make early is that this was my first real reinforcement learning project. OpenEnv was even newer to me. I came in cold.
 
 
 
36
 
37
+ What kept me from bailing in the first three days was, paradoxically, exactly that newness. I was already accepting I would be uncomfortable. I figured I might as well be uncomfortable about something that genuinely interested me. The Dunning-Kruger trough was waiting for me regardless. Better to fall into it doing something I would be proud to talk about afterwards.
 
 
 
 
 
 
 
 
38
 
39
+ So I started reading.
 
 
 
 
 
 
40
 
41
  ---
42
 
43
+ ## The Rothermel rabbit hole
 
 
 
 
 
 
 
 
 
 
44
 
45
+ For the unfamiliar: Richard Rothermel published the canonical surface-fire spread model in 1972, in a US Forest Service technical report titled, with monastic plainness, *"A Mathematical Model for Predicting Fire Spread in Wildland Fuels"*. It is roughly thirty pages, and it underwrites almost every operational fire-behaviour predictor used in the field since. BehavePlus, FARSITE, FlamMap, different tools, different abstractions, the same skeleton. I downloaded the PDF at four in the afternoon and was still reading it at midnight. I was unprepared for how much *taste* there was in those equations, the way Rothermel had to balance theoretical fidelity against parameters a real-world ranger could plausibly measure with a pole and a moisture probe. There is a particular kind of engineering elegance in a model that survives that long under field conditions.
46
 
47
+ What surprised me more, and what tipped this project from "interesting" to "I have to do this", was how thin the work at the *intersection* of language models, reinforcement learning, and wildfire response was. There are RL papers on fire suppression (Subramanian and Crowley's forest-fire DRL work, Julian and Kochenderfer on aircraft routing for wildfire surveillance, the FireCommander multi-UAV environment). There are LLM-as-agent papers on disaster response. There is an entire operations-research literature on resource allocation in incident command. But the middle of that Venn diagram, where an LLM is the actor inside an RL loop on a Rothermel-style spread environment, was almost empty. For a solo entrant that is a gift. Judges will see ten polished agentic-coding projects for every one project that even *attempts* something this far off the beaten path. I would rather be the rough sketch of something unusual than the tenth-best version of something normal.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ---
50
 
51
+ ## The metamorphosis
 
 
 
 
52
 
53
+ My first design was nothing like what I ended up shipping.
54
 
55
+ The initial prototype was a single-step decision agent. The model would receive a snapshot of the fire (a textual map, a brief weather summary, a list of crews) and produce *one* action. The environment would run a thirty-step simulation under a fixed policy, then report back a number. I would train against that number. Clean, simple, tractable. I built it out over two evenings.
56
 
57
+ It was bad. Not in the "the loss is high" way, but in the "this isn't actually the problem I want to solve" way. A single decision under a thirty-step rollout teaches the model almost nothing about *sequencing*. It teaches it to be a one-shot triagist, a useful skill in narrow contexts, but not what an incident commander actually does. An IC's job is to keep deciding as the situation degrades, to revise, to recover when a crew gets hurt and the wind shifts and the second ignition happens behind them. None of that was in v1.
58
 
59
+ The metamorphosis happened gradually. I added a step penalty so the model could not loiter. I added a terminal reward for population saved, and immediately the gradient washed out because the terminal reward was rare and small relative to per-step noise. I scaled the terminal reward up; the model learnt to game the per-step component instead. I added a *briefing*, a written paragraph describing priority zones, infrastructure, and the wind forecast, partly because real ICs read briefings, and partly because it gave me an honest measurement of instruction following. I added a curriculum because the hard tier, on its own, produced a flat reward curve and a sad-looking W&B chart. I added a second reward function for JSON validity because GRPO begs for it. I added a heuristic continuation step because I was burning GPU minutes making the model generate seven extra times per prompt, when the gradient only flowed through the *first* generation anyway.
60
 
61
+ Each of those decisions came from running the thing and watching it fail in a specific, legible way. The system I ended up with is not the system I designed. It is the system that survived.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ---
64
 
65
+ ## What I want the judges to take away
66
 
67
+ There were nights, there are always nights, where the project felt absurd. A solo participant. First RL project. A custom environment. A custom reward decomposition. SFT, then GRPO, on a 7B model, on Colab, with a curriculum controller and a callback that rebuilds the dataset mid-run. I made all the canonical mistakes, including, briefly, training on a dataset whose seeds did not match the seeds the reward function rolled out against, which is the GRPO equivalent of grading a student's geography exam by asking them about a different country. I found it the way you always find these things, by squinting at a sample completion at 2 a.m. and thinking *wait, that does not match*.
68
 
69
+ But I learned more in three weeks than I would have from a semester of well-mannered tutorials. Reinforcement learning, as a sub-field, is unforgiving in a productive way. It does not let you confuse "the loss is going down" with "the model is doing the thing". You have to look. You have to read rollouts. You have to talk to your model the way an IC talks to a crew chief, patiently, specifically, willing to be told the plan is wrong.
 
 
 
 
70
 
71
+ If a judge reads this far, I want to be candid about what I think the contribution actually is. The trained model does not beat the heuristic on hard tier. It approaches it on medium (+5.74 vs. +6.31) and falls short on hard, and I would rather submit honest numbers than goose them. The headline is not the leaderboard, it is the *artifact*: a typed, OpenEnv-compliant environment with a Rothermel-flavoured spread model, a decomposed reward built for GRPO, a serialiser that keeps prompt length sub-linear in grid size, a parser that refuses to crash on malformed completions, and an end-to-end training recipe somebody else can pick up tomorrow and improve on. Plus a frank post-mortem of every bug I hit along the way, which is, I suspect, more useful to the next person trying this than another tenth of a reward point would be.
72
 
73
+ I started this project because of a one-minute clip about somebody else's bad night. I am finishing it knowing more about reinforcement learning, more about wildland fire science, and a little more about my own tolerance for ambiguity than I did three weeks ago.
74
 
75
+ The sticky note is staying on the monitor.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ---
78
 
79
+ ## The five bugs that taught me the most
 
 
 
 
 
 
 
 
80
 
81
+ A short post-mortem, because the bugs are the part of the story you actually learn from.
82
 
83
+ 1. **Frozen dataset, live curriculum.** The controller promoted to *medium* at step 10 and *hard* at step 20, but the prompt dataset was built once before training and never refreshed. The model was happily being scored on easy prompts while the dashboard insisted it was on hard. Fixed with a `TrainerCallback` that rebuilds the dataset on tier change.
84
+ 2. **Truncated rollouts never saw terminal reward.** v1 ran a fixed 15 step rollout per completion. Hard tier needs at least 80 steps before the +5 survival bonus can fire, so GRPO was optimising against per-step deltas only. v2 runs to `env.done`. Twice as slow, gradient signal night-and-day better.
85
+ 3. **Prompt and reward state mismatch.** Each dataset row was generated with a fresh random seed, and the reward function picked *another* fresh seed at scoring time. The model was being graded on a different fire than the one in its prompt. Now every row carries its `seed`, and the reward function resets to that exact `(tier, seed)`.
86
+ 4. **Wasted inner generations.** v1 called `model.generate()` seven extra times per completion to build a multi-step rollout, but GRPO gradients only flow through the originally sampled completion. Those seven calls were expensive noise. Cutting `MODEL_STEPS` to 1 and letting the heuristic finish the episode dropped wall-clock per step by about 70%.
87
+ 5. **Format reward crashing on `obs=None`.** The action parser reads `obs.grid` to validate spatial fields, so calling it for a pure JSON-validity check crashed. A standalone `check_json_format()` that does not need an obs solved it in twenty lines.
88
 
89
  ---
90
 
91
  ## Links
92
 
93
+ - Live environment on Hugging Face: [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator)
94
+ - Source: [`Abrodolph/Wildfire-Containment-Simulator`](https://github.com/Abrodolph/Wildfire-Containment-Simulator)
95
+ - Trained model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b)
96
+ - W&B run: [`saini-eshit-/wildfire-grpo/runs/dnz56kuu`](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu)
97
+ - GRPO notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb)
98
+ - SFT notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb)
99
+ - Top-level overview: [`README.md`](README.md)
 
100
 
101
+ *Eshit, April 2026*