Update README
Browse files
README.md
CHANGED
|
@@ -1,361 +1,364 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: Wildfire Containment Simulator
|
| 3 |
-
emoji: π₯
|
| 4 |
-
colorFrom: red
|
| 5 |
-
colorTo: purple
|
| 6 |
-
sdk: docker
|
| 7 |
-
pinned: false
|
| 8 |
-
license: mit
|
| 9 |
-
tags:
|
| 10 |
-
- reinforcement-learning
|
| 11 |
-
- simulation
|
| 12 |
-
- openenv
|
| 13 |
-
- wildfire
|
| 14 |
-
- rl-environment
|
| 15 |
-
- long-horizon
|
| 16 |
-
- instruction-following
|
| 17 |
-
---
|
| 18 |
-
|
| 19 |
-
# Wildfire Containment Simulator
|
| 20 |
-
|
| 21 |
-
**Meta OpenEnv Hackathon β Theme 2: Long-Horizon Planning & Instruction Following**
|
| 22 |
-
|
| 23 |
-

|
| 24 |
-

|
| 25 |
-

|
| 26 |
-

|
| 27 |
-

|
| 28 |
-
|
| 29 |
-
A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80β300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
|
| 30 |
-
|
| 31 |
-
> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **
|
| 32 |
-
> *(
|
| 33 |
-
|
| 34 |
-
---
|
| 35 |
-
|
| 36 |
-
## π Quick Links
|
| 37 |
-
|
| 38 |
-
| Resource | Link |
|
| 39 |
-
|---|---|
|
| 40 |
-
| π **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
|
| 41 |
-
| π» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
|
| 42 |
-
| π **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
|
| 43 |
-
| π **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
|
| 44 |
-
| π **Long-form blog post** | [`BLOG.md`](BLOG.md) |
|
| 45 |
-
| π **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
|
| 46 |
-
| π **Training dashboard** | [
|
| 47 |
-
| π¬ **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
|
| 48 |
-
| π₯ **2-minute pitch video** | *(YouTube link coming soon)* |
|
| 49 |
-
|
| 50 |
-
---
|
| 51 |
-
|
| 52 |
-
## Why Theme 2
|
| 53 |
-
|
| 54 |
-
| Pillar | How we model it |
|
| 55 |
-
|---|---|
|
| 56 |
-
| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β greedy local moves cannot capture it. |
|
| 57 |
-
| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
|
| 58 |
-
| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
|
| 59 |
-
|
| 60 |
-
---
|
| 61 |
-
|
| 62 |
-
## Real-World Motivation
|
| 63 |
-
|
| 64 |
-
Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β so an LLM can be trained, evaluated, and inspected on it end-to-end.
|
| 65 |
-
|
| 66 |
-
For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
|
| 67 |
-
|
| 68 |
-
---
|
| 69 |
-
|
| 70 |
-
## Quickstart
|
| 71 |
-
|
| 72 |
-
```bash
|
| 73 |
-
# Clone and install
|
| 74 |
-
git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
|
| 75 |
-
cd Wildfire-Containment-Simulator
|
| 76 |
-
uv pip install -r requirements.txt
|
| 77 |
-
uv pip install -e .
|
| 78 |
-
|
| 79 |
-
# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
|
| 80 |
-
python scripts/evaluate.py 5
|
| 81 |
-
|
| 82 |
-
# Compare agents head-to-head
|
| 83 |
-
python scripts/eval_compare.py --seeds 42 43 44 45 46 \
|
| 84 |
-
--tiers easy medium hard --agents random heuristic
|
| 85 |
-
|
| 86 |
-
# Render an episode as a GIF
|
| 87 |
-
python scripts/replay.py --tier medium --seed 42 \
|
| 88 |
-
--agent heuristic --output demos/replay.gif
|
| 89 |
-
|
| 90 |
-
# Spin up the OpenEnv FastAPI server locally on port 7860
|
| 91 |
-
python server/app.py
|
| 92 |
-
# Then visit http://localhost:7860/ui/ for the interactive frontend
|
| 93 |
-
```
|
| 94 |
-
|
| 95 |
-
Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
|
| 96 |
-
|
| 97 |
-
---
|
| 98 |
-
|
| 99 |
-
## Live Hugging Face Space
|
| 100 |
-
|
| 101 |
-
The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β no Python import needed:
|
| 102 |
-
|
| 103 |
-
```bash
|
| 104 |
-
SPACE=https://eshit-wildfire-containment-simulator.hf.space
|
| 105 |
-
|
| 106 |
-
curl "$SPACE/health"
|
| 107 |
-
curl -X POST "$SPACE/reset?task_id=easy&seed=42"
|
| 108 |
-
curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
|
| 109 |
-
-d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
|
| 110 |
-
```
|
| 111 |
-
|
| 112 |
-
Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
|
| 113 |
-
|
| 114 |
-
---
|
| 115 |
-
|
| 116 |
-
## Environment API
|
| 117 |
-
|
| 118 |
-
```python
|
| 119 |
-
from env import WildfireEnv, Action, ActionType, Direction
|
| 120 |
-
|
| 121 |
-
env = WildfireEnv()
|
| 122 |
-
obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
|
| 123 |
-
|
| 124 |
-
while not env.done:
|
| 125 |
-
action = Action(
|
| 126 |
-
action_type=ActionType.DEPLOY_CREW,
|
| 127 |
-
crew_id="crew_0",
|
| 128 |
-
target_row=7, target_col=7,
|
| 129 |
-
)
|
| 130 |
-
result = env.step(action) # StepResult
|
| 131 |
-
obs = result.observation
|
| 132 |
-
reward = result.reward # decomposed float, range ~β8 to +8
|
| 133 |
-
done = result.done
|
| 134 |
-
|
| 135 |
-
state = env.state() # Full ground truth (grading only)
|
| 136 |
-
```
|
| 137 |
-
|
| 138 |
-
`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β agents must work from `Observation`.
|
| 139 |
-
|
| 140 |
-
---
|
| 141 |
-
|
| 142 |
-
## Action Space
|
| 143 |
-
|
| 144 |
-
All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
|
| 145 |
-
|
| 146 |
-
| Action | Required parameters | Description |
|
| 147 |
-
|---|---|---|
|
| 148 |
-
| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
|
| 149 |
-
| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
|
| 150 |
-
| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
|
| 151 |
-
| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ3 retardant drop with 5-step cooldown |
|
| 152 |
-
| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
|
| 153 |
-
| `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ10 area for 5 steps |
|
| 154 |
-
| `idle` | `reason` *(optional)* | Explicitly wait |
|
| 155 |
-
|
| 156 |
-
A 3-layer parser (`env/action_parser.py`) maps raw LLM output β structured `Action`: direct JSON β regex field extraction β safe-`idle` fallback. **The environment loop never breaks on bad model output.**
|
| 157 |
-
|
| 158 |
-
---
|
| 159 |
-
|
| 160 |
-
## Observation Space
|
| 161 |
-
|
| 162 |
-
| Component | Contents | Noise / occlusion |
|
| 163 |
-
|---|---|---|
|
| 164 |
-
| `briefing` | `OperationalBriefing` on first obs β incident ID, priority zones, infrastructure, wind forecast | First step only |
|
| 165 |
-
| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
|
| 166 |
-
| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard |
|
| 167 |
-
| `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
|
| 168 |
-
| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
|
| 169 |
-
| `recent_events` | Last 5 notable events | Fully observable |
|
| 170 |
-
|
| 171 |
-
The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
|
| 172 |
-
|
| 173 |
-
---
|
| 174 |
-
|
| 175 |
-
## Reward Function
|
| 176 |
-
|
| 177 |
-
Decomposed for GRPO β wide reward range produces meaningful advantages between rollout groups.
|
| 178 |
-
|
| 179 |
-
**Per-step (dense):**
|
| 180 |
-
```
|
| 181 |
-
step_reward = 0.4 Β· Ξcontainment + 0.4 Β· Ξpopulation_safety β 0.1 Β· redundant_action_flag
|
| 182 |
-
```
|
| 183 |
-
|
| 184 |
-
**Terminal (sparse, on episode end):**
|
| 185 |
-
```
|
| 186 |
-
+5.0 if all populations safe
|
| 187 |
-
+0β2.0 efficiency bonus (faster containment β more)
|
| 188 |
-
+1.0 briefing-adherence bonus (all priority zones survived)
|
| 189 |
-
β3.0 Β· (pop_lost / total_pop) if any population lost
|
| 190 |
-
β2.0 if any crew casualty
|
| 191 |
-
β0.01 Γ invalid_action_count capped at β0.2
|
| 192 |
-
```
|
| 193 |
-
|
| 194 |
-
Total empirical range: **β8 to +8**, declared in `openenv.yaml`.
|
| 195 |
-
|
| 196 |
-
| Tier | Spread scale | Episode length | Approx. reward ceiling |
|
| 197 |
-
|---|---|---|---|
|
| 198 |
-
| Easy | 1.00Γ | 80 | +8 |
|
| 199 |
-
| Medium | 0.70Γ | 150 | +7 |
|
| 200 |
-
| Hard | 0.55Γ | 300 | +6 |
|
| 201 |
-
|
| 202 |
-
---
|
| 203 |
-
|
| 204 |
-
## Three Difficulty Tiers
|
| 205 |
-
|
| 206 |
-
### Task 1 β Easy: Flatland Grass Fire
|
| 207 |
-
15Γ15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control.
|
| 208 |
-
|
| 209 |
-
### Task 2 β Medium: Canyon Terrain with Wind Shifts
|
| 210 |
-
25Γ25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
|
| 211 |
-
|
| 212 |
-
### Task 3 β Hard: Wildland-Urban Interface Crisis
|
| 213 |
-
40Γ40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
|
| 214 |
-
|
| 215 |
-
---
|
| 216 |
-
|
| 217 |
-
## Fire Spread Model
|
| 218 |
-
|
| 219 |
-
A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
|
| 220 |
-
|
| 221 |
-
```
|
| 222 |
-
P(ignite) = base_rate Γ fuel_factor Γ wind_factor Γ slope_factor
|
| 223 |
-
Γ (1 β moisture) Γ (1 β suppression) Γ tier_scale
|
| 224 |
-
```
|
| 225 |
-
|
| 226 |
-
| Factor | Effect |
|
| 227 |
-
|---|---|
|
| 228 |
-
| `base_rate` | Baseline spread by fuel type |
|
| 229 |
-
| `fuel_factor` | Fuel load of the target cell |
|
| 230 |
-
| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
|
| 231 |
-
| `slope_factor` | Faster uphill, slower downhill |
|
| 232 |
-
| `moisture` | Wet ground / recent rain reduces ignition probability |
|
| 233 |
-
| `suppression` | Crew presence and retardant coverage reduce spread |
|
| 234 |
-
| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
|
| 235 |
-
|
| 236 |
-
Burning cells progress through `BURNING β EMBER β BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
|
| 237 |
-
|
| 238 |
-
---
|
| 239 |
-
|
| 240 |
-
## Results
|
| 241 |
-
|
| 242 |
-
> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42β46. Trained-model numbers
|
| 243 |
-
|
| 244 |
-
| Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
|
| 245 |
-
|---|---|---|---|
|
| 246 |
-
| Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
|
| 247 |
-
| Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | +4.74 Β± 3.79 |
|
| 248 |
-
| **Trained Qwen-2.5-7B (ours)** |
|
| 249 |
-
| **Ξ vs. Heuristic** |
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
|
| 256 |
-
|
|
| 257 |
-
|
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
**
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
**
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
β βββ
|
| 288 |
-
β βββ
|
| 289 |
-
β βββ
|
| 290 |
-
β βββ
|
| 291 |
-
β βββ
|
| 292 |
-
β βββ
|
| 293 |
-
β βββ
|
| 294 |
-
β βββ
|
| 295 |
-
β
|
| 296 |
-
βββ
|
| 297 |
-
β βββ
|
| 298 |
-
β βββ
|
| 299 |
-
βββ
|
| 300 |
-
β βββ
|
| 301 |
-
β
|
| 302 |
-
|
| 303 |
-
βββ
|
| 304 |
-
β βββ
|
| 305 |
-
β
|
| 306 |
-
|
| 307 |
-
β βββ
|
| 308 |
-
β βββ
|
| 309 |
-
β βββ
|
| 310 |
-
β
|
| 311 |
-
βββ
|
| 312 |
-
β βββ
|
| 313 |
-
β
|
| 314 |
-
|
| 315 |
-
β βββ
|
| 316 |
-
β
|
| 317 |
-
βββ
|
| 318 |
-
β
|
| 319 |
-
|
| 320 |
-
βββ
|
| 321 |
-
|
| 322 |
-
βββ
|
| 323 |
-
βββ
|
| 324 |
-
βββ
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
--
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
}
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
|
| 360 |
-
|
| 361 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Wildfire Containment Simulator
|
| 3 |
+
emoji: π₯
|
| 4 |
+
colorFrom: red
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: docker
|
| 7 |
+
pinned: false
|
| 8 |
+
license: mit
|
| 9 |
+
tags:
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
- simulation
|
| 12 |
+
- openenv
|
| 13 |
+
- wildfire
|
| 14 |
+
- rl-environment
|
| 15 |
+
- long-horizon
|
| 16 |
+
- instruction-following
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Wildfire Containment Simulator
|
| 20 |
+
|
| 21 |
+
**Meta OpenEnv Hackathon β Theme 2: Long-Horizon Planning & Instruction Following**
|
| 22 |
+
|
| 23 |
+

|
| 24 |
+

|
| 25 |
+

|
| 26 |
+

|
| 27 |
+

|
| 28 |
+
|
| 29 |
+
A partially-observable disaster simulation where an LLM acts as **Incident Commander**, interpreting operational briefings, dispatching ground crews and air tankers, and recovering from cascading failures across 80β300-step episodes. Built on OpenEnv with Pydantic-typed actions, a Rothermel-inspired fire-spread model, and a decomposed reward designed for GRPO.
|
| 30 |
+
|
| 31 |
+
> **Headline result (post-training run, Apr 26):** Our trained Qwen-2.5-7B Incident Commander achieves a mean reward of **+5.74** on Medium tier β vs. **+6.31** for the rule-based heuristic and **+1.31** for the random baseline. The model auto-promoted through all three curriculum tiers (easy β medium β hard) in just 63 of 150 training steps, maintaining **99%+ JSON success rate** throughout.
|
| 32 |
+
> *(Full comparison table in [Results](#results). Model: [`Eshit/wildfire-grpo-7b`](https://huggingface.co/Eshit/wildfire-grpo-7b). W&B run: [wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu).)*
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## π Quick Links
|
| 37 |
+
|
| 38 |
+
| Resource | Link |
|
| 39 |
+
|---|---|
|
| 40 |
+
| π **Live HF Space (env)** | [huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) |
|
| 41 |
+
| π» **GitHub source** | [github.com/Abrodolph/Wildfire-Containment-Simulator](https://github.com/Abrodolph/Wildfire-Containment-Simulator) |
|
| 42 |
+
| π **GRPO training notebook** | [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb) |
|
| 43 |
+
| π **SFT warm-up notebook** | [`training/sft_colab.ipynb`](training/sft_colab.ipynb) |
|
| 44 |
+
| π **Long-form blog post** | [`BLOG.md`](BLOG.md) |
|
| 45 |
+
| π **Baseline eval JSON** | [`scripts/results.json`](scripts/results.json) |
|
| 46 |
+
| π **Training dashboard** | [W&B run: wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) |
|
| 47 |
+
| π¬ **Heuristic replay GIF** | [`demos/heuristic_replay.gif`](demos/heuristic_replay.gif) |
|
| 48 |
+
| π₯ **2-minute pitch video** | *(YouTube link coming soon)* |
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## Why Theme 2
|
| 53 |
+
|
| 54 |
+
| Pillar | How we model it |
|
| 55 |
+
|---|---|
|
| 56 |
+
| **Long-horizon planning** | Hard tier runs 300 steps with the +5.0 terminal reward only triggered on full population survival β greedy local moves cannot capture it. |
|
| 57 |
+
| **Instruction following** | Every episode opens with an `OperationalBriefing` (priority zones, infrastructure, weather forecast). A +1.0 adherence bonus rewards protecting the named priority zones. |
|
| 58 |
+
| **Recovery from failure** | Hard tier injects a second ignition at step 30 and forces a crew casualty at step 40. Reactive baselines that can't re-plan lose population. |
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## Real-World Motivation
|
| 63 |
+
|
| 64 |
+
Wildfire response is a real public-safety resource-allocation problem. Incident commanders must decide where to deploy crews, when to call air support, how to protect communities, and how to adapt when conditions change mid-operation. We built a structured RL environment that captures the key tensions of this work β partial observability, changing weather, hard resource limits, and explicit tradeoffs between speed, efficiency, area saved, and civilian safety β so an LLM can be trained, evaluated, and inspected on it end-to-end.
|
| 65 |
+
|
| 66 |
+
For the deeper story behind the design choices, see [`BLOG.md`](BLOG.md).
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Quickstart
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
# Clone and install
|
| 74 |
+
git clone https://github.com/Abrodolph/Wildfire-Containment-Simulator.git
|
| 75 |
+
cd Wildfire-Containment-Simulator
|
| 76 |
+
uv pip install -r requirements.txt
|
| 77 |
+
uv pip install -e .
|
| 78 |
+
|
| 79 |
+
# Run baseline evaluation (random + heuristic, all 3 tiers, 5 seeds)
|
| 80 |
+
python scripts/evaluate.py 5
|
| 81 |
+
|
| 82 |
+
# Compare agents head-to-head
|
| 83 |
+
python scripts/eval_compare.py --seeds 42 43 44 45 46 \
|
| 84 |
+
--tiers easy medium hard --agents random heuristic
|
| 85 |
+
|
| 86 |
+
# Render an episode as a GIF
|
| 87 |
+
python scripts/replay.py --tier medium --seed 42 \
|
| 88 |
+
--agent heuristic --output demos/replay.gif
|
| 89 |
+
|
| 90 |
+
# Spin up the OpenEnv FastAPI server locally on port 7860
|
| 91 |
+
python server/app.py
|
| 92 |
+
# Then visit http://localhost:7860/ui/ for the interactive frontend
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
Full test suite: `pytest tests -v` (41 tests, ~30s on CPU).
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## Live Hugging Face Space
|
| 100 |
+
|
| 101 |
+
The environment is deployed at [`Eshit/Wildfire-Containment-Simulator`](https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator) on Hugging Face. Any external agent can drive it over plain HTTP β no Python import needed:
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
SPACE=https://eshit-wildfire-containment-simulator.hf.space
|
| 105 |
+
|
| 106 |
+
curl "$SPACE/health"
|
| 107 |
+
curl -X POST "$SPACE/reset?task_id=easy&seed=42"
|
| 108 |
+
curl -X POST "$SPACE/step" -H "Content-Type: application/json" \
|
| 109 |
+
-d '{"action_type": "deploy_crew", "crew_id": "crew_0", "target_row": 7, "target_col": 7}'
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
Endpoints: `/reset`, `/step`, `/state`, `/state/render`, `/auto_step`, `/health`, `/docs` (Swagger UI), `/ui/` (interactive frontend).
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Environment API
|
| 117 |
+
|
| 118 |
+
```python
|
| 119 |
+
from env import WildfireEnv, Action, ActionType, Direction
|
| 120 |
+
|
| 121 |
+
env = WildfireEnv()
|
| 122 |
+
obs = env.reset(task_id="easy", seed=42) # Observation (with OperationalBriefing on first step)
|
| 123 |
+
|
| 124 |
+
while not env.done:
|
| 125 |
+
action = Action(
|
| 126 |
+
action_type=ActionType.DEPLOY_CREW,
|
| 127 |
+
crew_id="crew_0",
|
| 128 |
+
target_row=7, target_col=7,
|
| 129 |
+
)
|
| 130 |
+
result = env.step(action) # StepResult
|
| 131 |
+
obs = result.observation
|
| 132 |
+
reward = result.reward # decomposed float, range ~β8 to +8
|
| 133 |
+
done = result.done
|
| 134 |
+
|
| 135 |
+
state = env.state() # Full ground truth (grading only)
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
`reset(task_id, seed)` is fully deterministic. `state()` is intentionally exposed only for graders β agents must work from `Observation`.
|
| 139 |
+
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
+
## Action Space
|
| 143 |
+
|
| 144 |
+
All actions are Pydantic-validated. **Invalid actions return a penalty reward without crashing the environment.**
|
| 145 |
+
|
| 146 |
+
| Action | Required parameters | Description |
|
| 147 |
+
|---|---|---|
|
| 148 |
+
| `deploy_crew` | `crew_id`, `target_row`, `target_col` | Place an undeployed crew on a safe cell |
|
| 149 |
+
| `move_crew` | `crew_id`, `direction` (`N/S/E/W/NE/NW/SE/SW`) | Move a deployed crew one cell |
|
| 150 |
+
| `order_crew_objective` | `crew_id`, `objective` (`hold/advance/retreat/prioritize_*`) | Set a persistent directive for a crew's local policy |
|
| 151 |
+
| `drop_retardant` | `tanker_id`, `target_row`, `target_col` | 3Γ3 retardant drop with 5-step cooldown |
|
| 152 |
+
| `build_firebreak` | `crew_id`, `direction` | Permanent non-flammable cell adjacent to a crew |
|
| 153 |
+
| `recon_flight` | `target_row`, `target_col` | Reveal a 10Γ10 area for 5 steps |
|
| 154 |
+
| `idle` | `reason` *(optional)* | Explicitly wait |
|
| 155 |
+
|
| 156 |
+
A 3-layer parser (`env/action_parser.py`) maps raw LLM output β structured `Action`: direct JSON β regex field extraction β safe-`idle` fallback. **The environment loop never breaks on bad model output.**
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
+
## Observation Space
|
| 161 |
+
|
| 162 |
+
| Component | Contents | Noise / occlusion |
|
| 163 |
+
|---|---|---|
|
| 164 |
+
| `briefing` | `OperationalBriefing` on first obs β incident ID, priority zones, infrastructure, wind forecast | First step only |
|
| 165 |
+
| `grid` | 2D array of `CellObservation` (`fire_state`, `intensity_bin`, `smoke_density`, `is_populated`, `crew_present`) | Smoke occlusion (medium/hard); fog-of-war (hard) |
|
| 166 |
+
| `weather` | `wind_speed_kmh`, `wind_direction_deg`, `humidity_pct`, `rain_active` | Β±5 km/h, Β±20Β° on medium/hard |
|
| 167 |
+
| `resources` | Crew positions, tanker cooldowns, firebreak budget, recon budget | Fully observable |
|
| 168 |
+
| `stats` | `cells_burned`, `cells_burning`, `population_lost`, `containment_pct`, `current_step` | Fully observable |
|
| 169 |
+
| `recent_events` | Last 5 notable events | Fully observable |
|
| 170 |
+
|
| 171 |
+
The observation is rendered into LLM-friendly text via `serialize_observation()` (env/serialization.py), which BFS-clusters fire regions into bounding boxes so the prompt is `O(regions)` instead of `O(cells)`.
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## Reward Function
|
| 176 |
+
|
| 177 |
+
Decomposed for GRPO β wide reward range produces meaningful advantages between rollout groups.
|
| 178 |
+
|
| 179 |
+
**Per-step (dense):**
|
| 180 |
+
```
|
| 181 |
+
step_reward = 0.4 Β· Ξcontainment + 0.4 Β· Ξpopulation_safety β 0.1 Β· redundant_action_flag
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
**Terminal (sparse, on episode end):**
|
| 185 |
+
```
|
| 186 |
+
+5.0 if all populations safe
|
| 187 |
+
+0β2.0 efficiency bonus (faster containment β more)
|
| 188 |
+
+1.0 briefing-adherence bonus (all priority zones survived)
|
| 189 |
+
β3.0 Β· (pop_lost / total_pop) if any population lost
|
| 190 |
+
β2.0 if any crew casualty
|
| 191 |
+
β0.01 Γ invalid_action_count capped at β0.2
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
Total empirical range: **β8 to +8**, declared in `openenv.yaml`.
|
| 195 |
+
|
| 196 |
+
| Tier | Spread scale | Episode length | Approx. reward ceiling |
|
| 197 |
+
|---|---|---|---|
|
| 198 |
+
| Easy | 1.00Γ | 80 | +8 |
|
| 199 |
+
| Medium | 0.70Γ | 150 | +7 |
|
| 200 |
+
| Hard | 0.55Γ | 300 | +6 |
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
+
|
| 204 |
+
## Three Difficulty Tiers
|
| 205 |
+
|
| 206 |
+
### Task 1 β Easy: Flatland Grass Fire
|
| 207 |
+
15Γ15 flat grid Β· single ignition Β· constant wind Β· no smoke or fog-of-war Β· 4 crews, 1 tanker, 15 firebreak cells Β· 80 steps. **Focus:** basic deployment and perimeter control.
|
| 208 |
+
|
| 209 |
+
### Task 2 β Medium: Canyon Terrain with Wind Shifts
|
| 210 |
+
25Γ25 mixed terrain Β· two ignition points Β· variable wind Β· smoke occlusion Β· sensor noise Β· 5 crews, 2 tankers, 20 firebreak cells, 1 recon Β· 150 steps. **Focus:** terrain-aware containment under multi-front pressure.
|
| 211 |
+
|
| 212 |
+
### Task 3 β Hard: Wildland-Urban Interface Crisis
|
| 213 |
+
40Γ40 terrain with roads, rivers, urban zones Β· staggered ignitions (step 30) Β· scripted crew casualty (step 40) Β· fog-of-war (radius 7) Β· aggressive wind shifts Β· 6 crews, 3 tankers, 30 firebreak cells, 3 recon Β· 300 steps. **Focus:** long-horizon planning under uncertainty and recovery from cascading failures.
|
| 214 |
+
|
| 215 |
+
---
|
| 216 |
+
|
| 217 |
+
## Fire Spread Model
|
| 218 |
+
|
| 219 |
+
A **Rothermel-inspired cellular automaton** on the 8-cell Moore neighborhood. Each tick, every burning cell attempts to ignite each unburned neighbor:
|
| 220 |
+
|
| 221 |
+
```
|
| 222 |
+
P(ignite) = base_rate Γ fuel_factor Γ wind_factor Γ slope_factor
|
| 223 |
+
Γ (1 β moisture) Γ (1 β suppression) Γ tier_scale
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
| Factor | Effect |
|
| 227 |
+
|---|---|
|
| 228 |
+
| `base_rate` | Baseline spread by fuel type |
|
| 229 |
+
| `fuel_factor` | Fuel load of the target cell |
|
| 230 |
+
| `wind_factor` | Boost when wind aligns with the spread vector, dampened otherwise |
|
| 231 |
+
| `slope_factor` | Faster uphill, slower downhill |
|
| 232 |
+
| `moisture` | Wet ground / recent rain reduces ignition probability |
|
| 233 |
+
| `suppression` | Crew presence and retardant coverage reduce spread |
|
| 234 |
+
| `tier_scale` | `easy=1.00`, `medium=0.70`, `hard=0.55` |
|
| 235 |
+
|
| 236 |
+
Burning cells progress through `BURNING β EMBER β BURNED_OUT`. Urban cells have higher peak intensity but lower ignition probability.
|
| 237 |
+
|
| 238 |
+
---
|
| 239 |
+
|
| 240 |
+
## Results
|
| 241 |
+
|
| 242 |
+
> Baselines reproduced via `python scripts/evaluate.py 5` on seeds 42β46. Trained-model numbers from Section 10 of [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb), evaluated on seeds 42β56 (15 per tier, no overlap with training seeds 0β99).
|
| 243 |
+
|
| 244 |
+
| Agent | Easy (mean Β± std) | Medium (mean Β± std) | Hard (mean Β± std) |
|
| 245 |
+
|---|---|---|---|
|
| 246 |
+
| Random | +6.23 Β± 3.09 | +1.31 Β± 3.24 | +2.16 Β± 2.96 |
|
| 247 |
+
| Heuristic | **+7.53 Β± 0.08** | **+6.31 Β± 2.77** | **+4.74 Β± 3.79** |
|
| 248 |
+
| **Trained Qwen-2.5-7B (ours)** | +5.13 Β± 3.90 | **+5.74 Β± 3.07** | +2.14 Β± 2.87 |
|
| 249 |
+
| **Ξ vs. Heuristic** | β2.41 | **β0.58 β** | β2.59 |
|
| 250 |
+
|
| 251 |
+
The medium tier result passes the Β±1.0 of heuristic threshold (official passing criterion).
|
| 252 |
+
|
| 253 |
+
**Auxiliary metrics for the trained agent:**
|
| 254 |
+
|
| 255 |
+
| Metric | Easy | Medium | Hard |
|
| 256 |
+
|---|---|---|---|
|
| 257 |
+
| JSON success rate | 98.5% | 99.8% | 99.2% |
|
| 258 |
+
| Mean population saved % | 87% | 97% | 92% |
|
| 259 |
+
|
| 260 |
+
**Curriculum progression:** easy (steps 0οΏ½οΏ½οΏ½52) β medium (steps 53β62) β hard (steps 63β149). The model reached hard tier in just 63 of 150 training steps.
|
| 261 |
+
|
| 262 |
+
> Full scores in [`training/grpo_eval_results.json`](training/grpo_eval_results.json). Training history in [`training/training_stats.json`](training/training_stats.json).
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
+
|
| 266 |
+
## Training
|
| 267 |
+
|
| 268 |
+
We use a two-stage recipe:
|
| 269 |
+
|
| 270 |
+
1. **SFT warm-up** β generate 4,300 `(prompt, action_json)` pairs from the heuristic on successful episodes (filtered to `pop_lost == 0`), then fine-tune Qwen-2.5-7B-Instruct with Unsloth 4-bit + LoRA (`r=32`, MLP+attention adapters). Notebook: [`training/sft_colab.ipynb`](training/sft_colab.ipynb).
|
| 271 |
+
2. **GRPO (TRL `GRPOTrainer`)** β start from the SFT adapter, score completions by *resetting the env to the exact `(tier, seed)` that produced each prompt*, applying the candidate action, and running the heuristic to terminal. Two reward functions are passed to TRL: `reward_fn_outcome` (full episode reward) and `reward_fn_format` (JSON validity). Curriculum auto-promotes easy β medium β hard. Notebook: [`training/grpo_v2_colab.ipynb`](training/grpo_v2_colab.ipynb).
|
| 272 |
+
|
| 273 |
+
**Hardware:** A100 Large (40 GB) on a Hugging Face Space JupyterLab session. ~75 minutes total wall-clock time.
|
| 274 |
+
**Training stack:** `unsloth 2026.4.8` (4-bit QLoRA), `trl==0.20.0`, `datasets==3.4.1`, `transformers 5.5.0`, `peft`, `wandb`.
|
| 275 |
+
|
| 276 |
+
**Training plots:** W&B run [saini-eshit-/wildfire-grpo/runs/dnz56kuu](https://wandb.ai/saini-eshit-/wildfire-grpo/runs/dnz56kuu) (reward curve, KL divergence, format reward, curriculum tier timeline). Local dashboard: `training/training_dashboard.png` (not tracked in git β generate with `python scripts/plot_grpo_training.py`).
|
| 277 |
+
|
| 278 |
+
For the design rationale, the SFT/GRPO trade-offs, and a frank discussion of what went wrong on our first GRPO attempt, read [`BLOG.md`](BLOG.md).
|
| 279 |
+
|
| 280 |
+
---
|
| 281 |
+
|
| 282 |
+
## Project Structure
|
| 283 |
+
|
| 284 |
+
```text
|
| 285 |
+
Wildfire-Containment-Simulator/
|
| 286 |
+
βββ env/
|
| 287 |
+
β βββ wildfire_env.py # Main env: reset(), step(), state()
|
| 288 |
+
β βββ models.py # Pydantic action/observation/state models
|
| 289 |
+
β βββ grid.py # Terrain, smoke, moisture, fog-of-war
|
| 290 |
+
β βββ fire_spread.py # Cellular automaton fire propagation
|
| 291 |
+
β βββ weather.py # Stochastic weather engine
|
| 292 |
+
β βββ resources.py # Crews, tankers, firebreaks, recon
|
| 293 |
+
β βββ reward.py # Decomposed step + terminal reward
|
| 294 |
+
β βββ briefing.py # OperationalBriefing generation
|
| 295 |
+
β βββ serialization.py # Observation β LLM prompt
|
| 296 |
+
β βββ action_parser.py # LLM output β Action (3-layer fallback)
|
| 297 |
+
β βββ rendering.py # Frame rendering for GIF replays
|
| 298 |
+
β βββ curriculum.py # CurriculumController (auto-promote/demote)
|
| 299 |
+
βββ agents/
|
| 300 |
+
β βββ random_agent.py
|
| 301 |
+
β βββ heuristic_agent.py
|
| 302 |
+
βββ graders/
|
| 303 |
+
β βββ grader_easy.py # β (total_reward, details_dict)
|
| 304 |
+
β βββ grader_medium.py
|
| 305 |
+
β βββ grader_hard.py
|
| 306 |
+
βββ scripts/
|
| 307 |
+
β βββ evaluate.py # Baseline eval (random + heuristic)
|
| 308 |
+
β βββ eval_compare.py # Multi-agent comparison
|
| 309 |
+
β βββ eval_trained_model.py # Evaluate a trained adapter
|
| 310 |
+
β βββ generate_sft_data.py # Build SFT dataset from heuristic rollouts
|
| 311 |
+
β βββ replay.py # Render episode as GIF
|
| 312 |
+
β βββ run_demo.py # Pitch demo
|
| 313 |
+
β βββ plot_dashboard.py # 4-panel training curves
|
| 314 |
+
βββ training/
|
| 315 |
+
β βββ grpo_v2_colab.ipynb # GRPO notebook (canonical)
|
| 316 |
+
β βββ sft_colab.ipynb # SFT warm-up notebook
|
| 317 |
+
β βββ sft_data.jsonl # 4,300 SFT examples
|
| 318 |
+
β βββ requirements.txt # Training deps (Unsloth, TRL, etc.)
|
| 319 |
+
β βββ README.md
|
| 320 |
+
βββ server/
|
| 321 |
+
β βββ app.py # FastAPI on port 7860
|
| 322 |
+
βββ frontend/ # Interactive HTML/JS frontend served at /ui/
|
| 323 |
+
βββ tests/ # 41 pytest tests
|
| 324 |
+
βββ demos/ # GIF/PNG demo assets
|
| 325 |
+
βββ openenv.yaml # OpenEnv environment manifest
|
| 326 |
+
βββ Dockerfile # HF Space build
|
| 327 |
+
βββ BLOG.md # Long-form write-up
|
| 328 |
+
βββ README.md # You are here
|
| 329 |
+
```
|
| 330 |
+
|
| 331 |
+
---
|
| 332 |
+
|
| 333 |
+
## Architecture Decisions
|
| 334 |
+
|
| 335 |
+
1. **Decomposed reward for GRPO.** Dense per-step deltas (containment, population) plus sparse terminal spikes (+5 survival, β3 Γ loss, briefing adherence) give a wide reward range that produces meaningful advantages between GRPO rollout groups.
|
| 336 |
+
2. **Operational briefings as first-class instructions.** The briefing isn't cosmetic β protecting its named priority zones earns reward. This makes instruction-following measurable, not aspirational.
|
| 337 |
+
3. **Two-stage training (SFT β GRPO).** SFT teaches JSON-action formatting in ~1 epoch; GRPO then optimizes long-horizon strategy on the format-stable model. Going straight to GRPO from the base model produced near-zero reward in early experiments.
|
| 338 |
+
4. **3-layer action parser.** JSON parse β regex fallback β safe-`idle`. The training loop never breaks on malformed model output.
|
| 339 |
+
5. **Per-step (tier, seed) replay in the reward function.** Each GRPO completion is scored by replaying the *exact* env state that produced its prompt, not a random env. This was the single biggest fix between our v1 and v2 GRPO runs (see [`BLOG.md`](BLOG.md) β "What broke").
|
| 340 |
+
6. **Deterministic seeding.** `np.random.default_rng(seed)` is threaded through every subsystem β every run is byte-for-byte reproducible.
|
| 341 |
+
7. **OpenEnv compliance over framework lock-in.** The env is callable from Python (`env.reset/step/state`) and over HTTP (`/reset`, `/step`, `/state`). Any external agent β TRL, vLLM, an OpenAI-compatible API client, a curl loop β can drive it.
|
| 342 |
+
|
| 343 |
+
---
|
| 344 |
+
|
| 345 |
+
## Citation
|
| 346 |
+
|
| 347 |
+
If you use this environment, please cite:
|
| 348 |
+
|
| 349 |
+
```bibtex
|
| 350 |
+
@misc{wildfire-containment-simulator-2026,
|
| 351 |
+
title = {Wildfire Containment Simulator: Long-Horizon Planning and
|
| 352 |
+
Instruction Following for Disaster-Response LLM Agents},
|
| 353 |
+
author = {Team Wildfire},
|
| 354 |
+
year = {2026},
|
| 355 |
+
url = {https://huggingface.co/spaces/Eshit/Wildfire-Containment-Simulator},
|
| 356 |
+
note = {Meta OpenEnv Hackathon submission, Theme 2}
|
| 357 |
+
}
|
| 358 |
+
```
|
| 359 |
+
|
| 360 |
+
---
|
| 361 |
+
|
| 362 |
+
## License
|
| 363 |
+
|
| 364 |
+
[MIT](LICENSE). Built on [OpenEnv](https://github.com/openenv) for the Meta Γ Hugging Face Γ Scaler hackathon, April 2026.
|