---
title: Pyre — Crisis Navigation Environment
emoji: 🔥
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
---

# Pyre — Crisis Navigation Environment for LLM Agents

> *When buildings burn, the difference between a safe evacuation and a tragedy is the quality of decisions made in the first 60 seconds. Can we train an LLM to make them?*

**Pyre** places an LLM agent *inside* a burning building. The agent must navigate to safety under partial observability — no global map, a real health system, hard time pressure, and a fire that actively spreads, blocks exits, and permanently alters the floor plan.

**Links:**
🔥 [Live Space](https://krooz-pyre-env.hf.space) &nbsp;|&nbsp;
🤖 [Trained Model](https://huggingface.co/Krooz/pyre-ppo-agent) &nbsp;|&nbsp;
📓 [Colab Training](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing) &nbsp;|&nbsp;
📝 [Blog](BLOG.md)

---

## Why Pyre vs. existing environments

| Feature | `grid_world` | `maze_env` | `wildfire_env` | **Pyre** |
|---|---|---|---|---|
| Observability | Full | Full | Partial | **Partial, first-person, text** |
| Map dynamics | Static | Static | Dynamic (fire) | **Dynamic (fire + doors + burnout)** |
| Action richness | 4 moves | 4 moves | Suppression | **Movement + door control + look** |
| Agent role | Mover | Mover | Suppressor | **Survivor** |
| Reward complexity | Reach goal | Reach goal | Suppress fire | **14-component composite rubric** |

*`wildfire_env` trains an agent to fight fires from above; Pyre trains an agent to survive from inside.*

---

## Quick start

```bash
uv sync
uv run server           # → http://localhost:8000

# Health check
curl http://localhost:8000/health

# Start episode
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "medium"}'

# Take a step
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": "move", "direction": "north"}'

# Random baseline (smoke test)
python examples/random_agent.py --episodes 5 --verbose
```

### Python client

```python
from pyre_env import PyreEnv, PyreAction

with PyreEnv(base_url="http://localhost:8000") as env:
    result = env.reset()
    print(result.observation.narrative)
    result = env.step(PyreAction(action="move", direction="north"))
    print(f"Reward: {result.reward:.3f} | HP: {result.observation.agent_health}")
```

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `PORT` | `8000` | HTTP server port |
| `PYRE_MAX_STEPS` | `150` | Default max steps per episode (overridden by difficulty preset) |
| `PYRE_SEED` | `42` | Base RNG seed; each episode increments by 37 |
| `HF_TOKEN` | — | Required only for `training/push_to_hub.py` |

---

## Architecture

```
reset() / step()
    │
    ▼
PyreEnvironment          server/pyre_env_environment.py
    ├── floor_plan.py    Building template or procedural generation
    ├── fire_sim.py      Cellular automaton: spread → intensity → smoke
    ├── narrative.py     BFS visibility → first-person text + structured fields
    └── rubrics.py       14 composable reward components
    │
    ▼
PyreObservation          models.py
    ├── narrative        str  — primary LLM input
    ├── map_state        PyreMapState — full grid snapshot for RL encoders
    ├── reward           float
    ├── done             bool
    └── metadata         dict — fire params, distances, difficulty
```

### Data flow per step

1. `_execute_action()` — move / door / look / wait, returns feedback string
2. Check evacuation — agent on EXIT cell with `fire < 0.5` → success
3. `FireSim.step()` — advance fire, smoke, burn timers; may convert cells to OBSTACLE
4. Apply health damage from smoke (0.5–5 HP/step) and fire (10 HP/step)
5. `_compute_reward()` — call all 14 rubrics with shared kwargs
6. `build_narrative_observation()` — BFS visibility, compose text, collect action hints
7. `_build_map_state()` — assemble full grid snapshot for UI / RL encoder
8. Return `PyreObservation`

---

## Project structure

```
pyre_env/
├── models.py                        PyreAction, PyreObservation, PyreMapState, PyreState
├── client.py                        PyreEnv (EnvClient subclass, narrative-focused)
├── openenv.yaml                     OpenEnv manifest (space, fastapi, port 8000)
├── pyproject.toml
│
├── server/
│   ├── app.py                       FastAPI bootstrap; stateful /reset, /step, /state, /scene
│   ├── pyre_env_environment.py      PyreEnvironment state machine + difficulty presets
│   ├── floor_plan.py                3 hand-authored templates + procedural generator
│   ├── fire_sim.py                  Cellular automaton fire/smoke simulation
│   ├── narrative.py                 BFS visibility + first-person text renderer
│   └── rubrics.py                   14 composable reward rubric classes
│
├── frontend/
│   ├── src/
│   │   ├── App.tsx                  Dashboard shell: topbar, canvas zone, side panel
│   │   ├── components/Map2D.tsx     Canvas2D renderer: fire, smoke, fog-of-war, agent
│   │   ├── components/HUD.tsx       HP bar, wind compass, step counter overlay
│   │   ├── components/ControlPanel.tsx  Move/door controls, difficulty, auto-wait
│   │   └── components/StatusCard.tsx   Agent biometrics, environment stats
│   └── README.md                    Frontend setup and demo script
│
├── training/
│   ├── ppo/
│   │   ├── train_torch_ppo.py       PPO (in-process or `--server` for HTTP EnvClient)
│   │   ├── train_torch_ppo_http.py  Thin wrapper: forwards argv to `train_torch_ppo.py --server ...`
│   │   └── pyre_ppo_training.ipynb  Colab notebook (self-contained, talks to HF Space)
│   └── push_to_hub.py               Upload checkpoint + metrics to HuggingFace Hub
│
├── examples/
│   └── random_agent.py              Baseline: 70% hint-biased, 30% random
│
└── artifacts/                       Training outputs: .pt, .csv, .png
```

---

## Simulation layer

### Fire simulation (`server/fire_sim.py`)

A stochastic cellular automaton over a flat row-major grid. Each call to `FireSim.step()` runs three phases:

**Phase 1 — Ignition.** Any cell with `fire ≥ FIRE_BURNING (0.3)` tries to ignite each cardinal neighbor:

```
p_ignite = p_spread × (1 − humidity) × wind_multiplier × fuel_map[neighbor]
```

- **Wind multiplier**: dot product of spread direction with wind vector → downwind 2×, upwind 0.5×, crosswind 1×
- **Closed doors**: `DOOR_CLOSED_FIRE_FACTOR = 0.15` (fire crosses at 15% normal rate)
- **Fuel map**: per-cell float from `floor_plan.py`; office rooms 1.5×, exits 0.6×

**Phase 2 — Intensity.** Existing fire gains `FIRE_INTENSITY_GAIN (0.15) × fuel_map[i]` per step. When `burn_timer ≥ BURNOUT_TICKS (5)` and intensity reaches 1.0, the cell becomes `OBSTACLE` — permanently impassable rubble.

**Phase 3 — Smoke.** Smoke is sourced at +0.3/step for cells with `fire ≥ 0.3`, diffuses between neighbors at `SMOKE_SPREAD_RATE (0.20)`, passes through closed doors at 40% rate, and decays per cell according to `ventilation_map`.

**Key constants:**

| Constant | Value | Role |
|---|---|---|
| `FIRE_IGNITION` | 0.1 | Starting intensity for new ignitions |
| `FIRE_BURNING` | 0.3 | Threshold for spreading and causing damage |
| `FIRE_INTENSITY_GAIN` | 0.15 | Intensity added per step to burning cell |
| `BURNOUT_TICKS` | 5 | Steps at full intensity before cell → OBSTACLE |
| `DOOR_CLOSED_FIRE_FACTOR` | 0.15 | Fire spread multiplier through closed doors |
| `SMOKE_SPREAD_RATE` | 0.20 | Smoke diffusion rate between neighbors |
| `SMOKE_DOOR_FACTOR` | 0.40 | Smoke rate through closed doors |
| `EXIT_BLOCKED_FIRE_THRESHOLD` | 0.5 | Fire intensity at which an exit is considered blocked |

### Building templates (`server/floor_plan.py`)

Three hand-authored 16×16 templates for easy and medium difficulty:

| Template | Layout | Exits | Doors | Notes |
|---|---|---|---|---|
| `small_office` | Two corridor bands + office rooms N/S | 2 (W, E walls) | 8 (room↔corridor) | Agent spawns in corridor |
| `open_plan` | Open hall with 4 × 2×2 pillar obstacles | 2 (diagonal corners) | 0 | High ventilation throughout |
| `t_corridor` | T-shaped: vertical stem + horizontal bar | 3 (top, left, right) | 4 (rooms off stem) | Multiple route decisions |

Each template carries a `zone_map` (cell → zone label), derived `fuel_map`, and `ventilation_map`:

| Zone | Fuel multiplier | Smoke decay/step | Notes |
|---|---|---|---|
| `north/south_offices` | 1.5× | 0.010 | High fuel, poor ventilation |
| `west/east_rooms` | 1.5× | 0.010 | Same as offices |
| `main_corridor` | 1.0× | 0.028 | Baseline |
| `northwest/northeast/etc. hall` | 0.9× | 0.050 | Open plan — best ventilation |
| `exit` | 0.6× | 0.040 | Concrete, vented |

**Hard mode — procedural generation.** Episodes run on a freshly generated 20×24 floor plan every time:

1. **Room placement**: random non-overlapping rectangles (3–5 × 3–4 cells, 6–10 rooms)
2. **MST corridors**: Prim-style minimum spanning tree connecting room centers via L-shaped tunnels
3. **Exit placement**: deterministic tunnels from leftmost/rightmost floor cells to outer walls
4. **Connectivity guard**: BFS from agent spawn verifies ≥1 exit is reachable; up to 3 attempts; falls back to `small_office`

### Visibility (`server/narrative.py`)

BFS flood-fill from agent position, walls block expansion:

| Agent smoke level | Visibility radius |
|---|---|
| None / light (`< 0.5`) | 5 cells |
| Moderate (`0.5–0.8`) | 3 cells |
| Heavy (`≥ 0.8`) | 2 cells |

---

## What the agent sees

Every step, `narrative.py` assembles a first-person text observation from raw grid state:

```
You are in the **main_corridor**. The air is **moderate**.
Health: ████████░░ (85/100) | Wind: **EAST**
Flames are visible to the **west**.
Exits visible: exit_0_7 at 8m west.
Doors: door_1 (closed) at 2m east.
You hear: Fire alarm sounding; Smoke detector beeping.
Last action: You move south. The smoke is thick here.
Available actions: move(direction='north')  move(direction='south')
                   door(target_id='door_1', door_state='open')  look(direction='east')  wait()
```

The same state is also exposed as structured fields in `PyreObservation` (smoke level, fire direction, visible objects, blocked exits, action hints) and as a full grid snapshot in `PyreObservation.map_state` for programmatic / RL use.

---

## Action space

| Action | Parameters | Effect |
|---|---|---|
| `move` | `direction: north\|south\|east\|west` | Move one cell; blocked by walls, obstacles, closed doors |
| `door` | `target_id: str`, `door_state: open\|close` | Open or close a door within 2 cells Manhattan distance |
| `look` | `direction: north\|south\|east\|west` | Ray-scan up to 5 cells; returns per-cell smoke/fire/zone/door/exit detail. Time still advances. |
| `wait` | — | Skip turn |

---

## Reward function — all 14 components

### Per-step rubrics

| Class | Value | Condition |
|---|---|---|
| `TimeStepPenalty` | −0.01 | Every step |
| `ProgressReward` | +0.25 | `move` reduced BFS distance to nearest unblocked exit |
| `ProgressRegressionPenalty` | −0.15 | `move` increased BFS distance to nearest exit |
| `SafeProgressBonus` | +0.05 | Progress AND new cell has `smoke < 0.5` |
| `DangerPenalty` | −0.50 | `move` into cell with `smoke ≥ 0.5` OR adjacent to `fire ≥ 0.3` |
| `HealthDrainPenalty` | −0.02 × dmg | Proportional to HP lost this step |
| `StrategicDoorBonus` | +0.50 | Closed a door with a cardinal neighbor `fire ≥ 0.3`; once per door per episode |
| `ExplorationBonus` | +0.02 | `move` to a cell not visited this episode |

### Episode-end rubrics (fire only when `done=True`)

| Class | Value | Condition |
|---|---|---|
| `SelfSurviveBonus` | +5.0 | Agent evacuated alive |
| `HealthSurvivalBonus` | +1.5 × (hp/100) | Agent evacuated (range 0 → +1.5) |
| `SelfDeathPenalty` | −10.0 | Agent died (HP ≤ 0) |
| `TimeoutPenalty` | −5.0 to −8.0 | Alive but out of steps; scaled by `−5 − 3×(hp/100)` when exits were reachable |
| `NearMissBonus` | max(0, 3.0 − 0.5 × min_BFS_dist) | On death only; `min_BFS_dist` = closest BFS distance to any exit reached this episode |
| `TimeBonus` | +0.05 × remaining_steps | Agent evacuated |

**BFS note:** `ProgressReward`, `ProgressRegressionPenalty`, `SafeProgressBonus`, and `NearMissBonus` all use true BFS traversal distance (walls and obstacles block; closed doors are treated as passable so the reward models optimal reachability assuming doors can be opened). The PPO trainer’s exit “pull” (below) uses **Manhattan** distance to listed exit cells only for an extra shaping signal — it is not part of the environment rubric.

### PPO training script only (`training/ppo/train_torch_ppo.py`)

The server’s step reward above is **further adjusted** inside the training loop (not returned to HTTP clients): **−0.05** on `wait`; **−0.15** after a `move` if any cardinal neighbor has fire **> 0.15**; **−0.20** if the new position was already in the last **12** positions this episode; **+ max(0, 0.25 − 0.04 × d)** on `move` when not yet evacuated, where **d** is **Manhattan** distance to the nearest cell in `map_state.exit_positions`.

---

## Difficulty presets

| Level | Sources | Spread rate | Humidity | Wind | Max steps | Map |
|---|---|---|---|---|---|---|
| `easy` | 1 | 10–20% | 30–50% | CALM only | 200 | Fixed 16×16 templates |
| `medium` | 2–4 | 15–40% | 10–45% | Any | 150 | Fixed 16×16 templates |
| `hard` | 3–5 | 30–55% | 5–20% | Never CALM | 100 | Procedural 20×24 |

**Health damage rates** (applied after fire sim step):

| Condition | HP/step |
|---|---|
| Light smoke (`0.2–0.5`) | 0.5 |
| Moderate smoke (`0.5–0.8`) | 2.0 |
| Heavy smoke (`≥ 0.8`) | 5.0 |
| On fire (`fire ≥ 0.3`) | 10.0 |

Smoke and fire damage stack if both conditions apply.

---

## HTTP API

The FastAPI server exposes both the standard OpenEnv routes and additional endpoints:

| Method | Path | Body | Returns |
|---|---|---|---|
| `GET` | `/health` | — | `{"status": "ok"}` |
| `POST` | `/reset` | `{"difficulty": "medium", "seed": null}` | `{observation, reward, done, metadata}` |
| `POST` | `/step` | `{"action": "move", "direction": "north"}` | `{observation, reward, done, metadata}` |
| `GET` | `/state` | — | Full `PyreState` dump |
| `GET` | `/scene` | — | Structured scene graph for UI renderers |
| `GET` | `/` | — | Frontend `index.html` |

`/scene` returns a 5-channel per-cell tensor (`cell_type`, `fire`, `smoke`, `is_agent`, `is_visible`) plus structured `labels` (agent position/health/location, episode params, door registry) — consumed by the React frontend.

---

## Training

Three training surfaces share the same PPO algorithm core from `train_torch_ppo.py`:

### 1. In-process (fastest)

```bash
python training/ppo/train_torch_ppo.py \
  --episodes 500 \
  --device cuda \
  --difficulty-schedule easy,medium,hard \
  --patience-threshold 0.65 \
  --output artifacts/pyre_ppo.pt
```

### 2. HTTP (against live server)

`train_torch_ppo_http.py` is a thin wrapper; it runs `train_torch_ppo.py` with `--server http://localhost:8000` and passes through any other flags.

```bash
# Start server first
uv run server

# Equivalent: python training/ppo/train_torch_ppo.py --server http://localhost:8000 --episodes 300
python training/ppo/train_torch_ppo_http.py --episodes 300
```

### 3. Colab notebook (against HF Space)

Open [`training/ppo/pyre_ppo_training.ipynb`](training/ppo/pyre_ppo_training.ipynb) or the [hosted Colab](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing). The notebook points `SERVER_URL` at `https://krooz-pyre-env.hf.space` and trains entirely over HTTP.

### Observation encoding

The `ObservationEncoder` in `train_torch_ppo.py` encodes each `PyreObservation` into a **5,790-dim float32 vector** (`ObservationEncoder.base_dim`):

```
Grid:    24×24 × 10 channels = 5,760
         • 6 one-hot cell type (floor/wall/door_open/door_closed/exit/obstacle)
         • fire intensity [0,1]
         • smoke density  [0,1]
         • visibility mask (1=visible)
         • agent position mask

Scalars: 17 global features
         health, step_progress, fire_spread_rate, humidity,
         agent_x_norm, agent_y_norm, nearest_exit_distance,
         reachable_exit_count, visible_cell_count, fire_sources,
         smoke_severity, alive, evacuated,
         exit_dx_norm, exit_dy_norm, exit_manhattan_norm   ← exit compass (map-agnostic)

One-hots: wind (5) + difficulty (4: easy, medium, hard_fixed, hard) + route hint (4: N/S/W/E) = 13

Total: 5,760 + 17 + 13 = 5,790
```

With `--history-length 4` (default), four frames are stacked: **input_dim = 23,160**.

### Network architecture

```
Input (23,160)
  → LayerNorm → FC(512) → LayerNorm → ReLU
  → FC(256)   → LayerNorm → ReLU
  → FC(128)   → ReLU
       ├── Policy head → FC(37) logits + action mask (−∞ for invalid)
       └── Value head  → FC(1) scalar
```

The policy uses **37** discrete actions (4 move, 1 wait, 16 door-open, 16 door-close); `look` is not in the PPO head because the map encoder already carries visibility. Orthogonal init (√2 gain hidden layers, 0.01 policy head). Total parameters: **~12.1M** (varies slightly with `hidden-sizes`).

### Default PPO / curriculum flags

- `--difficulty-schedule` default: `easy,medium,hard_fixed,hard` (full three-stage path including two hard modes).
- Patience (defaults): threshold **0.65**, window **15** episodes, optional `--hard-mix-ratio` / `--hard-mix-dist` during the hard stage.

### Curriculum

The `PatienceCurriculum` gating:

```
Stay on current stage until:
  success_rate (last 30 eps) ≥ patience_threshold (default 0.65)
  for patience_window (default 15) consecutive episodes
→ then advance to the next stage in --difficulty-schedule

Final stage: optional replay of the previous stage (--hard-mix-ratio, default 0.25)
             or a custom distribution (--hard-mix-dist), to limit forgetting
```

With the default schedule `easy,medium,hard_fixed,hard`, the “replay” stage during **hard** is typically **`hard_fixed`**, not `medium`. The **`pyre_ppo_hard_v2`** run used an explicit `hard:…,medium:…,easy:…` mix instead; see `training/push_to_hub.py` for the exact flags.

### Push to HuggingFace Hub

```bash
export HF_TOKEN=hf_...
uv run python training/push_to_hub.py \
  --repo-id Krooz/pyre-ppo-agent \
  --stem pyre_ppo_hard_v2 \
  --artifacts-dir artifacts
```

Uploads `{stem}.pt`, `{stem}.csv`, `{stem}.png`, `{stem}_eval.csv`, and generates a model card README. The script’s embedded summary targets the **`pyre_ppo_hard_v2`** HTTP run; adjust `--stem` if you use a different checkpoint prefix. Trained weights: **[Krooz/pyre-ppo-agent](https://huggingface.co/Krooz/pyre-ppo-agent)**.

### Training results

![Pyre PPO — HTTP run `pyre_ppo_hard_v2`, 600 episodes, easy → medium → hard](artifacts/pyre_ppo_hard_v2.png)

**Primary run on record (`artifacts/pyre_ppo_hard_v2.*`): 600 HTTP episodes** with a patience-gated **easy → medium → hard** schedule, eval every 25 episodes on **hard** (see `training/push_to_hub.py` for the exact CLI, metrics table, and hub model-card text). Representative headline numbers from that run: **~55%** final training success rate (MA-20, graph title), **~52.7%** overall evacuation over all 600 episodes, and **~10.5%** evacuation on **hard** episodes within the run—showing the agent still struggles on fully procedural hard maps while improving on easy/medium.

**Earlier ablation (200 episodes, easy → medium only):** a previous curve reached **~75%** success on **medium** after 200 episodes (no `hard` in the training mix). That artifact set is no longer in the tree; the figure and CSV above supersede it for the hackathon write-up.

---

## Frontend

A cinematic real-time visualization built in React 19 + Vite + TypeScript.

```bash
cd frontend
npm install
npm run dev     # → http://localhost:5173
```

The map renderer (`src/components/Map2D.tsx`) uses HTML5 Canvas 2D with:
- **5-layer volumetric fire**: dark-red base → orange body → yellow core → white-hot tip → wind-bent plume
- **Ember particle system**: 200-max particles, wind-biased velocity, fade-out
- **Animated walls**: brick texture with heat-tint shift and crack lines near fire
- **Charred obstacles**: dark rubble cells with ember-glow when adjacent to fire
- **Fog-of-war**: per-cell alpha overlay; fire beacon glow punches through fog
- **Minecraft-style agent**: pixel-art character with health-based color theme (blue→orange→red→purple), gold health arc ring, and movement trail

The agent color changes with HP: `healthy (≥60%) → blue`, `moderate (30–59%) → orange`, `low (1–29%) → red`, `critical (≤0%) → purple`.

The right side panel polls `/scene` every 500ms and shows tactical controls, per-door state (open/closed/failed), agent biometrics, environment stats, event log with reward annotations, and raw network activity.

---

## Deployment

```bash
openenv push --repo-id your-org/pyre-env
```

The `openenv.yaml` manifest declares this as a FastAPI space on port 8000. Docker configuration is in `server/Dockerfile`.

---

## Roadmap

The current architecture — cellular automaton physics, composable rubrics, BFS-based visibility, narrative observation layer, dual LLM+RL interface — is designed to generalise. Planned extensions:

### Other natural disasters
The `FireSim` is one implementation of a physics layer. The same environment shell supports alternative calamity models with minimal changes:

| Disaster | Physics swap | New mechanic |
|---|---|---|
| **Flood** | Water pressure + rising level grid | Agent must find high ground or exits before water fills corridors |
| **Earthquake** | Probabilistic wall collapse | Rubble blocks form during episode; structural integrity per cell |
| **Chemical spill** | Wind-borne toxin concentration | Invisible hazard; agent must infer spread direction from health decay |
| **Wildfire (ground level)** | Existing fire sim, outdoor map | No walls, wind-dominated spread, sparse exits |

Each shares the same reward rubric composability, observation layer, and training stack.

### NPC characters
The floor plan templates already define `spawn_zones` and the state model has placeholders for multi-agent positions. Next steps:
- Add panicking civilians who move randomly and block corridors
- Rescue mechanic: escort NPCs to exits for bonus reward
- Theory-of-mind challenge: agent must model NPC movement to plan around them
- Competing agent: second RL agent racing for the same exit (mixed cooperative/competitive)

### 3D maps and multi-floor buildings
- Stack floor levels connected by staircases
- Fire spreads both horizontally and vertically through floor openings
- 3D BFS cone-of-vision observation (currently 2D flood-fill)
- Elevator shafts as high-risk shortcuts
- Procedural multi-floor generator extending the existing Prim-MST approach

### LLM fine-tuning (GRPO)
- `training/` already scaffolds GRPO infrastructure alongside PPO
- Fine-tune a language model's policy directly on Pyre episode rollouts
- Compare: PPO on structured grid vs GRPO on text narrative — does the LLM develop genuine spatial reasoning or pattern-match the narrative?

### Harder curriculum stages
- `extreme` difficulty: procedural map, 5 fire sources, humidity 0–5%, always hurricane-force wind, 75 max steps
- Dynamic difficulty adjustment: real-time difficulty scaling based on agent rolling success rate
- Adversarial fire placement: second agent controls fire source positions to maximise agent failure

---

## Hackathon alignment

- **Theme #2 — Long-Horizon Planning**: 50–200 step episodes; agent must build a mental map across many partial observations with no global state
- **Theme #3.1 — World Modeling**: no global map; agent infers fire spread direction, corridor topology, and exit reachability from local first-person text observations alone