Rewrite BLOG.md as a personal narrative for hackathon submission
Browse files
BLOG.md
CHANGED
|
@@ -1,246 +1,101 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
*
|
| 4 |
-
|
| 5 |
-
> **TL;DR.** We built a partially-observable wildfire-response RL environment on OpenEnv, generated 4,300 supervised examples from a hand-coded heuristic, did a 1-epoch SFT warm-up on Qwen-2.5-7B-Instruct, then ran GRPO with a curriculum that auto-promotes the agent across three difficulty tiers. The trained agent reaches **{TBD}** mean reward on Hard tier (heuristic baseline: +4.74; random: +2.16). Code, env, training notebooks, and a live HF Space are all linked from the [`README`](README.md).
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## Why wildfires?
|
| 10 |
-
|
| 11 |
-
Most RL environments for language models are puzzles, games, or code tasks. We wanted something with three properties at once:
|
| 12 |
-
|
| 13 |
-
1. **Long-horizon, sparse-terminal reward.** A real plan has to survive 100+ steps before the result lands.
|
| 14 |
-
2. **Partial observability that *gets worse* during the episode.** Smoke spreads, recon expires, fog-of-war hides what hasn't been scouted recently.
|
| 15 |
-
3. **An explicit instruction-following channel.** A first-step "operational briefing" the agent must read, internalize, and adhere to — and a reward term that rewards adherence.
|
| 16 |
-
|
| 17 |
-
Wildfire incident command hits all three. An incident commander gets a briefing, has hard resource limits (crews, air tankers, firebreak budget, recon), and has to balance speed vs. coverage vs. civilian safety while wind, slope, and humidity all change underneath them. We turned that into a structured grid environment with typed actions, an `OperationalBriefing` on reset, and a decomposed reward — and then trained an LLM to play the role of the IC.
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
-
## The
|
| 22 |
-
|
| 23 |
-
The environment is OpenEnv-compliant: `reset(task_id, seed) → Observation`, `step(Action) → StepResult`, `state() → dict`. Three difficulty tiers, all runnable on the same code path:
|
| 24 |
|
| 25 |
-
|
| 26 |
-
Easy → 15×15 flat grid, 1 ignition, constant wind, 80 steps
|
| 27 |
-
Medium → 25×25 canyon terrain, 2 ignitions, wind shifts, smoke, 150 steps
|
| 28 |
-
Hard → 40×40 wildland-urban interface, staggered ignitions,
|
| 29 |
-
fog-of-war, mid-episode crew casualty, 300 steps
|
| 30 |
-
```
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
validate(action) → execute(action) → spread_fire → apply_suppression
|
| 36 |
-
→ evolve_weather → update_moisture → propagate_smoke → tick_cooldowns
|
| 37 |
-
→ expire_recon → trigger_scripted_events → compute_reward → check_termination
|
| 38 |
-
```
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
-
P(ignite) = base_rate × fuel_factor × wind_factor × slope_factor
|
| 44 |
-
× (1 − moisture) × (1 − suppression) × tier_scale
|
| 45 |
-
```
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
-
##
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
-
|
| 56 |
-
- BFS-clusters fire cells into bounding boxes ("3 BURNING clusters near rows 7–12, cols 3–8") so prompt length is `O(regions)` not `O(cells)`.
|
| 57 |
-
- Lists resource state with cooldown warnings.
|
| 58 |
-
- Surfaces the last 5 notable events.
|
| 59 |
-
- Notes weather noise levels explicitly so the model knows readings are not exact.
|
| 60 |
|
| 61 |
-
**
|
| 62 |
-
1. Strip code fences, find a JSON object, parse it directly.
|
| 63 |
-
2. If JSON parsing fails: regex-extract `action_type` and per-action fields.
|
| 64 |
-
3. Final fallback: return a safe `IDLE`. The training loop never breaks on bad model output.
|
| 65 |
-
|
| 66 |
-
That parser fallback is also a **defense against reward hacking** — there's no clever output that crashes the env or skips a step. Worst case the model burns a step on `IDLE` and pays the small step penalty.
|
| 67 |
|
| 68 |
---
|
| 69 |
|
| 70 |
-
## The
|
| 71 |
-
|
| 72 |
-
GRPO computes advantages by comparing rollout rewards within a group of completions for the same prompt. If your reward signal is too narrow (e.g. all rewards in `[0, 1]`), the advantages collapse and the gradient washes out. We deliberately built a wide-range, decomposed reward.
|
| 73 |
|
| 74 |
-
|
| 75 |
-
```
|
| 76 |
-
step_reward = 0.4·Δcontainment + 0.4·Δpop_safety − 0.1·redundant_action
|
| 77 |
-
```
|
| 78 |
|
| 79 |
-
|
| 80 |
-
```
|
| 81 |
-
+5.0 if zero population lost
|
| 82 |
-
+0–2.0 efficiency bonus (faster containment ⇒ more)
|
| 83 |
-
+1.0 briefing-adherence bonus (all priority zones survived)
|
| 84 |
-
−3.0 · (pop_lost / total_pop) if any population lost
|
| 85 |
-
−2.0 if any crew casualty
|
| 86 |
-
−0.01 × invalid_action_count capped at −0.2
|
| 87 |
-
```
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
**Two reward functions, not one.** For GRPO we register two reward functions with TRL:
|
| 92 |
-
- `reward_fn_outcome` — the full episodic reward described above (computed by *running the full episode*, see "What broke" below).
|
| 93 |
-
- `reward_fn_format` — a tiny standalone JSON-format check (`+0.15` for valid JSON with a recognized `action_type`, `0.0` for valid JSON with an unknown type, `−0.20` for unparseable garbage). This rewards good formatting independently from policy quality.
|
| 94 |
-
|
| 95 |
-
This is the "multiple independent reward functions" pattern from the OpenEnv hackathon guide — and it cost us about 30 lines of code.
|
| 96 |
|
| 97 |
---
|
| 98 |
|
| 99 |
-
##
|
| 100 |
-
|
| 101 |
-
### Stage 1 — SFT warm-up (~30 min)
|
| 102 |
-
|
| 103 |
-
We harvested 4,300 `(prompt, action_json)` pairs from `HeuristicAgent` rollouts on successful episodes (filtered to `population_lost == 0`):
|
| 104 |
-
|
| 105 |
-
| Tier | Examples |
|
| 106 |
-
|---|---|
|
| 107 |
-
| Easy | 2,000 |
|
| 108 |
-
| Medium | 1,500 |
|
| 109 |
-
| Hard | 800 |
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
Starting from the SFT adapter, we run TRL's `GRPOTrainer` with 8 generations per prompt, `learning_rate=3e-6`, `max_completion_length=192`. The reward function is the key piece:
|
| 116 |
-
|
| 117 |
-
```python
|
| 118 |
-
def reward_fn_outcome(completions, prompts, tier=None, seed=None, **kwargs):
|
| 119 |
-
rewards = []
|
| 120 |
-
for i, completion in enumerate(completions):
|
| 121 |
-
env = WildfireEnv()
|
| 122 |
-
# CRUCIAL: replay the EXACT (tier, seed) that produced this prompt
|
| 123 |
-
obs = env.reset(task_id=tier[i], seed=seed[i])
|
| 124 |
-
action, _ = parse_action(completion, obs)
|
| 125 |
-
result = env.step(action)
|
| 126 |
-
total = result.reward
|
| 127 |
-
# Heuristic carries the episode to completion so terminal reward fires
|
| 128 |
-
heuristic = HeuristicAgent()
|
| 129 |
-
while not env.done:
|
| 130 |
-
result = env.step(heuristic.act(env._current_obs))
|
| 131 |
-
total += result.reward
|
| 132 |
-
rewards.append(total)
|
| 133 |
-
return rewards
|
| 134 |
-
```
|
| 135 |
-
|
| 136 |
-
The `CurriculumController` watches a rolling 10-batch reward average and promotes the dataset from easy → medium → hard. A `TrainerCallback` rebuilds the prompt dataset whenever a promotion fires, so prompts and reward states stay synchronized.
|
| 137 |
|
| 138 |
---
|
| 139 |
|
| 140 |
-
##
|
| 141 |
-
|
| 142 |
-
We're including this section because we think the bugs are more interesting than the headline numbers.
|
| 143 |
-
|
| 144 |
-
### v1 GRPO bug #1 — Frozen dataset, live curriculum
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
-
|
| 151 |
|
| 152 |
-
The
|
| 153 |
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
### v1 GRPO bug #3 — Prompt/reward state mismatch
|
| 157 |
-
|
| 158 |
-
The most insidious bug. The dataset's prompts were generated from `(tier, seed=fresh_random)`. The reward function then **picked a different random seed** to roll out against. So the model was being scored in a completely different env state than the one shown in its prompt. Imagine being asked "what would you do here?" while shown a photo of New York, and graded on what would have happened in Tokyo.
|
| 159 |
-
|
| 160 |
-
**Fix:** every dataset row stores its `seed`. The reward function reads `seed` from `kwargs` (TRL passes dataset columns through as kwargs) and resets the env to that exact `(tier, seed)`. Prompt state and reward state are now identical.
|
| 161 |
-
|
| 162 |
-
### v1 GRPO bug #4 — Wasted inner generations
|
| 163 |
-
|
| 164 |
-
The v1 reward function called `model.generate()` *seven extra times per completion* to build a multi-step rollout. But GRPO gradients only flow through the originally sampled completion — those 7 extra generations were expensive noise.
|
| 165 |
-
|
| 166 |
-
**Fix:** `MODEL_STEPS = 1`. The model's sampled completion is applied as the step-0 action; the heuristic carries the rest. The wall-clock per training step dropped by ~70%.
|
| 167 |
-
|
| 168 |
-
### v1 GRPO bug #5 — Crash on format-only reward
|
| 169 |
-
|
| 170 |
-
We tried to add a format-validity reward early on, but `parse_action(text, obs)` reads `obs.grid` to validate spatial fields. Calling it with `obs=None` for a pure format check crashed.
|
| 171 |
-
|
| 172 |
-
**Fix:** a standalone `check_json_format(text)` function that doesn't need an obs. Three-state output (`json_success / regex_fallback / safe_idle`) → reward `(+0.15 / 0 / −0.20)`.
|
| 173 |
-
|
| 174 |
-
We're being open about these bugs because we think *the post-mortem matters more than the leaderboard.* Anyone training GRPO on a custom OpenEnv environment is likely to hit at least three of these five.
|
| 175 |
|
| 176 |
---
|
| 177 |
|
| 178 |
-
##
|
| 179 |
|
| 180 |
-
|
| 181 |
|
| 182 |
-
|
| 183 |
-
|---|---|---|---|
|
| 184 |
-
| Random | +6.23 ± 3.09 | +1.31 ± 3.24 | +2.16 ± 2.96 |
|
| 185 |
-
| Heuristic | +7.53 ± 0.08 | +6.31 ± 2.77 | +4.74 ± 3.79 |
|
| 186 |
-
| **Trained Qwen-2.5-7B (ours)** | **{TBD}** | **{TBD}** | **{TBD}** |
|
| 187 |
|
| 188 |
-
|
| 189 |
|
| 190 |
-
|
| 191 |
|
| 192 |
-
The
|
| 193 |
-
|
| 194 |
-
### What the trained agent learned (qualitatively)
|
| 195 |
-
|
| 196 |
-
*Filled in post-run from inspection of `training/samples/call_*.txt`:*
|
| 197 |
-
|
| 198 |
-
- {TBD: behavioral pattern 1, e.g. "tends to drop retardant ahead of wind direction rather than reactively"}
|
| 199 |
-
- {TBD: behavioral pattern 2, e.g. "deploys crews around priority zones first, even when fire is closer to lower-priority cells"}
|
| 200 |
-
- {TBD: behavioral pattern 3, e.g. "saves recon for mid-episode after staggered ignition fires"}
|
| 201 |
-
|
| 202 |
-
---
|
| 203 |
-
|
| 204 |
-
## Key learnings
|
| 205 |
-
|
| 206 |
-
1. **Reward decomposition matters more than model size.** A wide, structured reward gave a 7B model enough signal to surpass random and approach the heuristic on medium. We expect a 1.5B model would also work — the bottleneck is reward design, not parameters.
|
| 207 |
-
2. **Curriculum is essential for long-horizon tasks.** Throwing hard tier directly at the SFT model produced near-zero gradient signal — the +5 terminal bonus was almost never observed. Easy → medium → hard with auto-promotion was the difference.
|
| 208 |
-
3. **Format compliance must be a first-class reward, not an afterthought.** The format-only reward function (`+0.15 / 0 / −0.20`) cost us 30 lines and meaningfully reduced parse-failure rate during training. It also makes the JSON success rate trackable as an independent metric.
|
| 209 |
-
4. **Replay the prompt's exact env state when scoring completions.** Stochastic env resets in your reward function turn GRPO into "what's a good action *somewhere*?" instead of "what's a good action *here*?". The latter is what you actually want.
|
| 210 |
-
5. **Heuristic continuation is a powerful variance-reduction trick.** Letting the heuristic finish each rollout reduces noise from the model's later (uncertain) actions, so the gradient signal mostly reflects the *first* action's quality. Combined with full-episode rollout, you get terminal reward without 300 model.generate() calls per training step.
|
| 211 |
-
6. **Inspect generations on disk every N steps.** TRL's stdout logging shows you `mean_reward` only. Saving the first completion of each batch to `training/samples/call_{n}.txt` is what catches reward hacking and format regressions before they become catastrophic.
|
| 212 |
|
| 213 |
---
|
| 214 |
|
| 215 |
-
##
|
| 216 |
-
|
| 217 |
-
- **Heuristic continuation is a double-edged sword.** It reduces variance, but the reward attributes a good outcome to the model's first action even when the heuristic deserves most of the credit. A planned ablation: train one model with heuristic continuation and one with full-model rollout, compare on held-out seeds.
|
| 218 |
-
- **Hard tier still has high variance.** Heuristic std on hard is ±3.79 — bimodal between full saves and total losses. Smoothing the ignition-spawn distribution (`_find_ignition_candidate` in `wildfire_env.py`) would reduce this.
|
| 219 |
-
- **Single-tenant FastAPI server.** The HF Space currently uses a module-level `_env` singleton. Two concurrent users would clobber each other's episode. Per-session env binding via cookie/header is a 30-line fix we deferred.
|
| 220 |
-
- **Held-out generalization untested at scale.** We evaluate on seeds 200–214 (15 per tier) which don't appear in the 0–99 training pool. A larger holdout (say 200–999) would tighten the confidence intervals.
|
| 221 |
-
- **No multi-agent coordination experiments yet.** Each crew already runs a local autonomous policy; an obvious next step is to also let multiple LLM ICs collaborate on a shared incident.
|
| 222 |
-
|
| 223 |
-
---
|
| 224 |
|
| 225 |
-
|
| 226 |
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
|
| 233 |
---
|
| 234 |
|
| 235 |
## Links
|
| 236 |
|
| 237 |
-
-
|
| 238 |
-
-
|
| 239 |
-
-
|
| 240 |
-
-
|
| 241 |
-
-
|
| 242 |
-
-
|
| 243 |
-
-
|
| 244 |
-
- 📄 Top-level overview: [`README.md`](README.md)
|
| 245 |
|
| 246 |
-
*
|
|
|
|
| 1 |
+
# The Long Walk to an Incident Commander
|
| 2 |
|
| 3 |
+
*How a stray clip about the LA fires turned into my first reinforcement learning project.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
+
## The empty shortlist
|
|
|
|
|
|
|
| 8 |
|
| 9 |
+
When the OpenEnv hackathon was announced, I did not have an idea.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
+
I had a shortlist, sure, the kind every hobbyist machine-learning person carries around in the back of their head. A code-review agent. A structured-output LLM judge. Something to do with compliance reports, an idea I abandoned within a day because the moment I tried to write a one-paragraph pitch I started yawning. There is a useful diagnostic in that, I think. If your own pitch bores you in draft, it will bore everybody else louder.
|
| 12 |
|
| 13 |
+
So I scrolled. Embarrassingly enough, that is where the project really starts.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
+
It was around one in the morning, and I was doom-scrolling YouTube shorts in the kind of fugue where individual videos stop registering and become a slurry of noise. An NBC Bay Area clip drifted past. Then aerial footage of the LA fires, hills the colour of rust, voice-overs droning numbers about acres and containment percentages. What I remember is the way one reporter said, almost in passing, *"the incident commander has decided to pull crews back"*. A single human being, deciding at three in the morning their time, where to put the people, where to drop the retardant, which neighbourhoods to write off. I closed the app, went to bed, and didn't think about it again for a couple of days.
|
| 16 |
|
| 17 |
+
But it kept coming back. Quietly, at first, while I was supposed to be reading the hackathon brief. The themes mapped almost too cleanly onto the shape of the thing that incident commander on the clip was doing: incomplete information, hard resource limits, weather you cannot argue with, civilians you cannot disappoint. I scribbled *"Wildfire Incident Commander as an RL task"* on a sticky note and pressed it to the side of my monitor. The note is still there, slightly buckled at the corner now, with coffee on it.
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
There was one problem with this plan. I had never actually trained a reinforcement learning policy.
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
+
## How I got into RL in the first place
|
| 24 |
|
| 25 |
+
My exposure to RL up to this point was almost entirely cinematic.
|
| 26 |
|
| 27 |
+
I had grown up on those grainy DeepMind Atari videos, the ones where a tiny green paddle slowly figures out it can tunnel through the side of a Breakout wall and bounce the ball around behind it. I remember rewatching that clip on loop the first time I saw it and feeling something genuinely uncanny. The agent had not been told about the tunnel. Nobody coded the tunnel. It just appeared, somewhere in the loss landscape, as the cheapest way to keep the ball alive.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
From there it was the usual pipeline. The *AlphaGo* documentary, late one weekend. OpenAI's hide-and-seek video where the agents start surfing on boxes their opponents are still trying to lock down. Two Minute Papers explaining AlphaStar and OpenAI Five with that delighted Hungarian cadence. I read Sutton and Barto in the way most people read Sutton and Barto, which is to say, three chapters in great detail and the rest in spirit. I read the Mnih DQN paper, the Schulman PPO paper, eventually the DeepSeek-R1 work and the GRPO derivations, and I poked at a couple of CartPole notebooks. But I had never actually trained a policy that mattered. RL had this folkloric reputation around it, finicky, expensive, vibes-based, the part of the deep-learning toolbox most likely to silently fail in interesting ways. I had a healthy fear of the field.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
---
|
| 32 |
|
| 33 |
+
## The confession
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
So the confession I should make early is that this was my first real reinforcement learning project. OpenEnv was even newer to me. I came in cold.
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
+
What kept me from bailing in the first three days was, paradoxically, exactly that newness. I was already accepting I would be uncomfortable. I figured I might as well be uncomfortable about something that genuinely interested me. The Dunning-Kruger trough was waiting for me regardless. Better to fall into it doing something I would be proud to talk about afterwards.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
+
So I started reading.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
---
|
| 42 |
|
| 43 |
+
## The Rothermel rabbit hole
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
For the unfamiliar: Richard Rothermel published the canonical surface-fire spread model in 1972, in a US Forest Service technical report titled, with monastic plainness, *"A Mathematical Model for Predicting Fire Spread in Wildland Fuels"*. It is roughly thirty pages, and it underwrites almost every operational fire-behaviour predictor used in the field since. BehavePlus, FARSITE, FlamMap, different tools, different abstractions, the same skeleton. I downloaded the PDF at four in the afternoon and was still reading it at midnight. I was unprepared for how much *taste* there was in those equations, the way Rothermel had to balance theoretical fidelity against parameters a real-world ranger could plausibly measure with a pole and a moisture probe. There is a particular kind of engineering elegance in a model that survives that long under field conditions.
|
| 46 |
|
| 47 |
+
What surprised me more, and what tipped this project from "interesting" to "I have to do this", was how thin the work at the *intersection* of language models, reinforcement learning, and wildfire response was. There are RL papers on fire suppression (Subramanian and Crowley's forest-fire DRL work, Julian and Kochenderfer on aircraft routing for wildfire surveillance, the FireCommander multi-UAV environment). There are LLM-as-agent papers on disaster response. There is an entire operations-research literature on resource allocation in incident command. But the middle of that Venn diagram, where an LLM is the actor inside an RL loop on a Rothermel-style spread environment, was almost empty. For a solo entrant that is a gift. Judges will see ten polished agentic-coding projects for every one project that even *attempts* something this far off the beaten path. I would rather be the rough sketch of something unusual than the tenth-best version of something normal.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
+
## The metamorphosis
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+
My first design was nothing like what I ended up shipping.
|
| 54 |
|
| 55 |
+
The initial prototype was a single-step decision agent. The model would receive a snapshot of the fire (a textual map, a brief weather summary, a list of crews) and produce *one* action. The environment would run a thirty-step simulation under a fixed policy, then report back a number. I would train against that number. Clean, simple, tractable. I built it out over two evenings.
|
| 56 |
|
| 57 |
+
It was bad. Not in the "the loss is high" way, but in the "this isn't actually the problem I want to solve" way. A single decision under a thirty-step rollout teaches the model almost nothing about *sequencing*. It teaches it to be a one-shot triagist, a useful skill in narrow contexts, but not what an incident commander actually does. An IC's job is to keep deciding as the situation degrades, to revise, to recover when a crew gets hurt and the wind shifts and the second ignition happens behind them. None of that was in v1.
|
| 58 |
|
| 59 |
+
The metamorphosis happened gradually. I added a step penalty so the model could not loiter. I added a terminal reward for population saved, and immediately the gradient washed out because the terminal reward was rare and small relative to per-step noise. I scaled the terminal reward up; the model learnt to game the per-step component instead. I added a *briefing*, a written paragraph describing priority zones, infrastructure, and the wind forecast, partly because real ICs read briefings, and partly because it gave me an honest measurement of instruction following. I added a curriculum because the hard tier, on its own, produced a flat reward curve and a sad-looking W&B chart. I added a second reward function for JSON validity because GRPO begs for it. I added a heuristic continuation step because I was burning GPU minutes making the model generate seven extra times per prompt, when the gradient only flowed through the *first* generation anyway.
|
| 60 |
|
| 61 |
+
Each of those decisions came from running the thing and watching it fail in a specific, legible way. The system I ended up with is not the system I designed. It is the system that survived.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
---
|
| 64 |
|
| 65 |
+
## What I want the judges to take away
|
| 66 |
|
| 67 |
+
There were nights, there are always nights, where the project felt absurd. A solo participant. First RL project. A custom environment. A custom reward decomposition. SFT, then GRPO, on a 7B model, on Colab, with a curriculum controller and a callback that rebuilds the dataset mid-run. I made all the canonical mistakes, including, briefly, training on a dataset whose seeds did not match the seeds the reward function rolled out against, which is the GRPO equivalent of grading a student's geography exam by asking them about a different country. I found it the way you always find these things, by squinting at a sample completion at 2 a.m. and thinking *wait, that does not match*.
|
| 68 |
|
| 69 |
+
But I learned more in three weeks than I would have from a semester of well-mannered tutorials. Reinforcement learning, as a sub-field, is unforgiving in a productive way. It does not let you confuse "the loss is going down" with "the model is doing the thing". You have to look. You have to read rollouts. You have to talk to your model the way an IC talks to a crew chief, patiently, specifically, willing to be told the plan is wrong.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
+
If a judge reads this far, I want to be candid about what I think the contribution actually is. The trained model does not beat the heuristic on hard tier. It approaches it on medium (+5.74 vs. +6.31) and falls short on hard, and I would rather submit honest numbers than goose them. The headline is not the leaderboard, it is the *artifact*: a typed, OpenEnv-compliant environment with a Rothermel-flavoured spread model, a decomposed reward built for GRPO, a serialiser that keeps prompt length sub-linear in grid size, a parser that refuses to crash on malformed completions, and an end-to-end training recipe somebody else can pick up tomorrow and improve on. Plus a frank post-mortem of every bug I hit along the way, which is, I suspect, more useful to the next person trying this than another tenth of a reward point would be.
|
| 72 |
|
| 73 |
+
I started this project because of a one-minute clip about somebody else's bad night. I am finishing it knowing more about reinforcement learning, more about wildland fire science, and a little more about my own tolerance for ambiguity than I did three weeks ago.
|
| 74 |
|
| 75 |
+
The sticky note is staying on the monitor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
---
|
| 78 |
|
| 79 |
+
## The five bugs that taught me the most
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
A short post-mortem, because the bugs are the part of the story you actually learn from.
|
| 82 |
|
| 83 |
+
1. **Frozen dataset, live curriculum.** The controller promoted to *medium* at step 10 and *hard* at step 20, but the prompt dataset was built once before training and never refreshed. The model was happily being scored on easy prompts while the dashboard insisted it was on hard. Fixed with a `TrainerCallback` that rebuilds the dataset on tier change.
|
| 84 |
+
2. **Truncated rollouts never saw terminal reward.** v1 ran a fixed 15 step rollout per completion. Hard tier needs at least 80 steps before the +5 survival bonus can fire, so GRPO was optimising against per-step deltas only. v2 runs to `env.done`. Twice as slow, gradient signal night-and-day better.
|
| 85 |
+
3. **Prompt and reward state mismatch.** Each dataset row was generated with a fresh random seed, and the reward function picked *another* fresh seed at scoring time. The model was being graded on a different fire than the one in its prompt. Now every row carries its `seed`, and the reward function resets to that exact `(tier, seed)`.
|
| 86 |
+
4. **Wasted inner generations.** v1 called `model.generate()` seven extra times per completion to build a multi-step rollout, but GRPO gradients only flow through the originally sampled completion. Those seven calls were expensive noise. Cutting `MODEL_STEPS` to 1 and letting the heuristic finish the episode dropped wall-clock per step by about 70%.
|
| 87 |
+
5. **Format reward crashing on `obs=None`.** The action parser reads `obs.grid` to validate spatial fields, so calling it for a pure JSON-validity check crashed. A standalone `check_json_format()` that does not need an obs solved it in twenty lines.
|
| 88 |
|
| 89 |
---
|
| 90 |
|
| 91 |
## Links
|
| 92 |
|
| 93 |
+
- Live environment on Hugging Face: [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator)
|
| 94 |
+
- Source: [`Abrodolph/Wildfire-Containment-Simulator`](https://github.com/Abrodolph/Wildfire-Containment-Simulator)
|
| 95 |
+
- Trained model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b)
|
| 96 |
+
- W&B run: [`saini-eshit-/wildfire-grpo/runs/dnz56kuu`](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu)
|
| 97 |
+
- GRPO notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb)
|
| 98 |
+
- SFT notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb)
|
| 99 |
+
- Top-level overview: [`README.md`](README.md)
|
|
|
|
| 100 |
|
| 101 |
+
*Eshit, April 2026*
|