--- title: Pyre β€” Crisis Navigation Environment emoji: πŸ”₯ colorFrom: red colorTo: yellow sdk: docker pinned: false app_port: 8000 tags: - openenv --- # Pyre β€” Crisis Navigation Environment for LLM Agents > *When buildings burn, the difference between a safe evacuation and a tragedy is the quality of decisions made in the first 60 seconds. Can we train an LLM to make them?* **Pyre** places an LLM agent *inside* a burning building. The agent must navigate to safety under partial observability β€” no global map, a real health system, hard time pressure, and a fire that actively spreads, blocks exits, and permanently alters the floor plan. **Links:** πŸ”₯ [Live Space](https://krooz-pyre-env.hf.space)  |  πŸ€– [Trained Model](https://huggingface.co/Krooz/pyre-ppo-agent)  |  πŸ““ [Colab Training](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing)  |  πŸ“ [Blog](BLOG.md) --- ## Why Pyre vs. existing environments | Feature | `grid_world` | `maze_env` | `wildfire_env` | **Pyre** | |---|---|---|---|---| | Observability | Full | Full | Partial | **Partial, first-person, text** | | Map dynamics | Static | Static | Dynamic (fire) | **Dynamic (fire + doors + burnout)** | | Action richness | 4 moves | 4 moves | Suppression | **Movement + door control + look** | | Agent role | Mover | Mover | Suppressor | **Survivor** | | Reward complexity | Reach goal | Reach goal | Suppress fire | **14-component composite rubric** | *`wildfire_env` trains an agent to fight fires from above; Pyre trains an agent to survive from inside.* --- ## Quick start ```bash uv sync uv run server # β†’ http://localhost:8000 # Health check curl http://localhost:8000/health # Start episode curl -X POST http://localhost:8000/reset \ -H "Content-Type: application/json" \ -d '{"difficulty": "medium"}' # Take a step curl -X POST http://localhost:8000/step \ -H "Content-Type: application/json" \ -d '{"action": "move", "direction": "north"}' # Random baseline (smoke test) python examples/random_agent.py --episodes 5 --verbose ``` ### Python client ```python from pyre_env import PyreEnv, PyreAction with PyreEnv(base_url="http://localhost:8000") as env: result = env.reset() print(result.observation.narrative) result = env.step(PyreAction(action="move", direction="north")) print(f"Reward: {result.reward:.3f} | HP: {result.observation.agent_health}") ``` ### Environment variables | Variable | Default | Description | |---|---|---| | `PORT` | `8000` | HTTP server port | | `PYRE_MAX_STEPS` | `150` | Default max steps per episode (overridden by difficulty preset) | | `PYRE_SEED` | `42` | Base RNG seed; each episode increments by 37 | | `HF_TOKEN` | β€” | Required only for `training/push_to_hub.py` | --- ## Architecture ``` reset() / step() β”‚ β–Ό PyreEnvironment server/pyre_env_environment.py β”œβ”€β”€ floor_plan.py Building template or procedural generation β”œβ”€β”€ fire_sim.py Cellular automaton: spread β†’ intensity β†’ smoke β”œβ”€β”€ narrative.py BFS visibility β†’ first-person text + structured fields └── rubrics.py 14 composable reward components β”‚ β–Ό PyreObservation models.py β”œβ”€β”€ narrative str β€” primary LLM input β”œβ”€β”€ map_state PyreMapState β€” full grid snapshot for RL encoders β”œβ”€β”€ reward float β”œβ”€β”€ done bool └── metadata dict β€” fire params, distances, difficulty ``` ### Data flow per step 1. `_execute_action()` β€” move / door / look / wait, returns feedback string 2. Check evacuation β€” agent on EXIT cell with `fire < 0.5` β†’ success 3. `FireSim.step()` β€” advance fire, smoke, burn timers; may convert cells to OBSTACLE 4. Apply health damage from smoke (0.5–5 HP/step) and fire (10 HP/step) 5. `_compute_reward()` β€” call all 14 rubrics with shared kwargs 6. `build_narrative_observation()` β€” BFS visibility, compose text, collect action hints 7. `_build_map_state()` β€” assemble full grid snapshot for UI / RL encoder 8. Return `PyreObservation` --- ## Project structure ``` pyre_env/ β”œβ”€β”€ models.py PyreAction, PyreObservation, PyreMapState, PyreState β”œβ”€β”€ client.py PyreEnv (EnvClient subclass, narrative-focused) β”œβ”€β”€ openenv.yaml OpenEnv manifest (space, fastapi, port 8000) β”œβ”€β”€ pyproject.toml β”‚ β”œβ”€β”€ server/ β”‚ β”œβ”€β”€ app.py FastAPI bootstrap; stateful /reset, /step, /state, /scene β”‚ β”œβ”€β”€ pyre_env_environment.py PyreEnvironment state machine + difficulty presets β”‚ β”œβ”€β”€ floor_plan.py 3 hand-authored templates + procedural generator β”‚ β”œβ”€β”€ fire_sim.py Cellular automaton fire/smoke simulation β”‚ β”œβ”€β”€ narrative.py BFS visibility + first-person text renderer β”‚ └── rubrics.py 14 composable reward rubric classes β”‚ β”œβ”€β”€ frontend/ β”‚ β”œβ”€β”€ src/ β”‚ β”‚ β”œβ”€β”€ App.tsx Dashboard shell: topbar, canvas zone, side panel β”‚ β”‚ β”œβ”€β”€ components/Map2D.tsx Canvas2D renderer: fire, smoke, fog-of-war, agent β”‚ β”‚ β”œβ”€β”€ components/HUD.tsx HP bar, wind compass, step counter overlay β”‚ β”‚ β”œβ”€β”€ components/ControlPanel.tsx Move/door controls, difficulty, auto-wait β”‚ β”‚ └── components/StatusCard.tsx Agent biometrics, environment stats β”‚ └── README.md Frontend setup and demo script β”‚ β”œβ”€β”€ training/ β”‚ β”œβ”€β”€ ppo/ β”‚ β”‚ β”œβ”€β”€ train_torch_ppo.py PPO (in-process or `--server` for HTTP EnvClient) β”‚ β”‚ β”œβ”€β”€ train_torch_ppo_http.py Thin wrapper: forwards argv to `train_torch_ppo.py --server ...` β”‚ β”‚ └── pyre_ppo_training.ipynb Colab notebook (self-contained, talks to HF Space) β”‚ └── push_to_hub.py Upload checkpoint + metrics to HuggingFace Hub β”‚ β”œβ”€β”€ examples/ β”‚ └── random_agent.py Baseline: 70% hint-biased, 30% random β”‚ └── artifacts/ Training outputs: .pt, .csv, .png ``` --- ## Simulation layer ### Fire simulation (`server/fire_sim.py`) A stochastic cellular automaton over a flat row-major grid. Each call to `FireSim.step()` runs three phases: **Phase 1 β€” Ignition.** Any cell with `fire β‰₯ FIRE_BURNING (0.3)` tries to ignite each cardinal neighbor: ``` p_ignite = p_spread Γ— (1 βˆ’ humidity) Γ— wind_multiplier Γ— fuel_map[neighbor] ``` - **Wind multiplier**: dot product of spread direction with wind vector β†’ downwind 2Γ—, upwind 0.5Γ—, crosswind 1Γ— - **Closed doors**: `DOOR_CLOSED_FIRE_FACTOR = 0.15` (fire crosses at 15% normal rate) - **Fuel map**: per-cell float from `floor_plan.py`; office rooms 1.5Γ—, exits 0.6Γ— **Phase 2 β€” Intensity.** Existing fire gains `FIRE_INTENSITY_GAIN (0.15) Γ— fuel_map[i]` per step. When `burn_timer β‰₯ BURNOUT_TICKS (5)` and intensity reaches 1.0, the cell becomes `OBSTACLE` β€” permanently impassable rubble. **Phase 3 β€” Smoke.** Smoke is sourced at +0.3/step for cells with `fire β‰₯ 0.3`, diffuses between neighbors at `SMOKE_SPREAD_RATE (0.20)`, passes through closed doors at 40% rate, and decays per cell according to `ventilation_map`. **Key constants:** | Constant | Value | Role | |---|---|---| | `FIRE_IGNITION` | 0.1 | Starting intensity for new ignitions | | `FIRE_BURNING` | 0.3 | Threshold for spreading and causing damage | | `FIRE_INTENSITY_GAIN` | 0.15 | Intensity added per step to burning cell | | `BURNOUT_TICKS` | 5 | Steps at full intensity before cell β†’ OBSTACLE | | `DOOR_CLOSED_FIRE_FACTOR` | 0.15 | Fire spread multiplier through closed doors | | `SMOKE_SPREAD_RATE` | 0.20 | Smoke diffusion rate between neighbors | | `SMOKE_DOOR_FACTOR` | 0.40 | Smoke rate through closed doors | | `EXIT_BLOCKED_FIRE_THRESHOLD` | 0.5 | Fire intensity at which an exit is considered blocked | ### Building templates (`server/floor_plan.py`) Three hand-authored 16Γ—16 templates for easy and medium difficulty: | Template | Layout | Exits | Doors | Notes | |---|---|---|---|---| | `small_office` | Two corridor bands + office rooms N/S | 2 (W, E walls) | 8 (room↔corridor) | Agent spawns in corridor | | `open_plan` | Open hall with 4 Γ— 2Γ—2 pillar obstacles | 2 (diagonal corners) | 0 | High ventilation throughout | | `t_corridor` | T-shaped: vertical stem + horizontal bar | 3 (top, left, right) | 4 (rooms off stem) | Multiple route decisions | Each template carries a `zone_map` (cell β†’ zone label), derived `fuel_map`, and `ventilation_map`: | Zone | Fuel multiplier | Smoke decay/step | Notes | |---|---|---|---| | `north/south_offices` | 1.5Γ— | 0.010 | High fuel, poor ventilation | | `west/east_rooms` | 1.5Γ— | 0.010 | Same as offices | | `main_corridor` | 1.0Γ— | 0.028 | Baseline | | `northwest/northeast/etc. hall` | 0.9Γ— | 0.050 | Open plan β€” best ventilation | | `exit` | 0.6Γ— | 0.040 | Concrete, vented | **Hard mode β€” procedural generation.** Episodes run on a freshly generated 20Γ—24 floor plan every time: 1. **Room placement**: random non-overlapping rectangles (3–5 Γ— 3–4 cells, 6–10 rooms) 2. **MST corridors**: Prim-style minimum spanning tree connecting room centers via L-shaped tunnels 3. **Exit placement**: deterministic tunnels from leftmost/rightmost floor cells to outer walls 4. **Connectivity guard**: BFS from agent spawn verifies β‰₯1 exit is reachable; up to 3 attempts; falls back to `small_office` ### Visibility (`server/narrative.py`) BFS flood-fill from agent position, walls block expansion: | Agent smoke level | Visibility radius | |---|---| | None / light (`< 0.5`) | 5 cells | | Moderate (`0.5–0.8`) | 3 cells | | Heavy (`β‰₯ 0.8`) | 2 cells | --- ## What the agent sees Every step, `narrative.py` assembles a first-person text observation from raw grid state: ``` You are in the **main_corridor**. The air is **moderate**. Health: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ (85/100) | Wind: **EAST** Flames are visible to the **west**. Exits visible: exit_0_7 at 8m west. Doors: door_1 (closed) at 2m east. You hear: Fire alarm sounding; Smoke detector beeping. Last action: You move south. The smoke is thick here. Available actions: move(direction='north') move(direction='south') door(target_id='door_1', door_state='open') look(direction='east') wait() ``` The same state is also exposed as structured fields in `PyreObservation` (smoke level, fire direction, visible objects, blocked exits, action hints) and as a full grid snapshot in `PyreObservation.map_state` for programmatic / RL use. --- ## Action space | Action | Parameters | Effect | |---|---|---| | `move` | `direction: north\|south\|east\|west` | Move one cell; blocked by walls, obstacles, closed doors | | `door` | `target_id: str`, `door_state: open\|close` | Open or close a door within 2 cells Manhattan distance | | `look` | `direction: north\|south\|east\|west` | Ray-scan up to 5 cells; returns per-cell smoke/fire/zone/door/exit detail. Time still advances. | | `wait` | β€” | Skip turn | --- ## Reward function β€” all 14 components ### Per-step rubrics | Class | Value | Condition | |---|---|---| | `TimeStepPenalty` | βˆ’0.01 | Every step | | `ProgressReward` | +0.25 | `move` reduced BFS distance to nearest unblocked exit | | `ProgressRegressionPenalty` | βˆ’0.15 | `move` increased BFS distance to nearest exit | | `SafeProgressBonus` | +0.05 | Progress AND new cell has `smoke < 0.5` | | `DangerPenalty` | βˆ’0.50 | `move` into cell with `smoke β‰₯ 0.5` OR adjacent to `fire β‰₯ 0.3` | | `HealthDrainPenalty` | βˆ’0.02 Γ— dmg | Proportional to HP lost this step | | `StrategicDoorBonus` | +0.50 | Closed a door with a cardinal neighbor `fire β‰₯ 0.3`; once per door per episode | | `ExplorationBonus` | +0.02 | `move` to a cell not visited this episode | ### Episode-end rubrics (fire only when `done=True`) | Class | Value | Condition | |---|---|---| | `SelfSurviveBonus` | +5.0 | Agent evacuated alive | | `HealthSurvivalBonus` | +1.5 Γ— (hp/100) | Agent evacuated (range 0 β†’ +1.5) | | `SelfDeathPenalty` | βˆ’10.0 | Agent died (HP ≀ 0) | | `TimeoutPenalty` | βˆ’5.0 to βˆ’8.0 | Alive but out of steps; scaled by `βˆ’5 βˆ’ 3Γ—(hp/100)` when exits were reachable | | `NearMissBonus` | max(0, 3.0 βˆ’ 0.5 Γ— min_BFS_dist) | On death only; `min_BFS_dist` = closest BFS distance to any exit reached this episode | | `TimeBonus` | +0.05 Γ— remaining_steps | Agent evacuated | **BFS note:** `ProgressReward`, `ProgressRegressionPenalty`, `SafeProgressBonus`, and `NearMissBonus` all use true BFS traversal distance (walls and obstacles block; closed doors are treated as passable so the reward models optimal reachability assuming doors can be opened). The PPO trainer’s exit β€œpull” (below) uses **Manhattan** distance to listed exit cells only for an extra shaping signal β€” it is not part of the environment rubric. ### PPO training script only (`training/ppo/train_torch_ppo.py`) The server’s step reward above is **further adjusted** inside the training loop (not returned to HTTP clients): **βˆ’0.05** on `wait`; **βˆ’0.15** after a `move` if any cardinal neighbor has fire **> 0.15**; **βˆ’0.20** if the new position was already in the last **12** positions this episode; **+ max(0, 0.25 βˆ’ 0.04 Γ— d)** on `move` when not yet evacuated, where **d** is **Manhattan** distance to the nearest cell in `map_state.exit_positions`. --- ## Difficulty presets | Level | Sources | Spread rate | Humidity | Wind | Max steps | Map | |---|---|---|---|---|---|---| | `easy` | 1 | 10–20% | 30–50% | CALM only | 200 | Fixed 16Γ—16 templates | | `medium` | 2–4 | 15–40% | 10–45% | Any | 150 | Fixed 16Γ—16 templates | | `hard` | 3–5 | 30–55% | 5–20% | Never CALM | 100 | Procedural 20Γ—24 | **Health damage rates** (applied after fire sim step): | Condition | HP/step | |---|---| | Light smoke (`0.2–0.5`) | 0.5 | | Moderate smoke (`0.5–0.8`) | 2.0 | | Heavy smoke (`β‰₯ 0.8`) | 5.0 | | On fire (`fire β‰₯ 0.3`) | 10.0 | Smoke and fire damage stack if both conditions apply. --- ## HTTP API The FastAPI server exposes both the standard OpenEnv routes and additional endpoints: | Method | Path | Body | Returns | |---|---|---|---| | `GET` | `/health` | β€” | `{"status": "ok"}` | | `POST` | `/reset` | `{"difficulty": "medium", "seed": null}` | `{observation, reward, done, metadata}` | | `POST` | `/step` | `{"action": "move", "direction": "north"}` | `{observation, reward, done, metadata}` | | `GET` | `/state` | β€” | Full `PyreState` dump | | `GET` | `/scene` | β€” | Structured scene graph for UI renderers | | `GET` | `/` | β€” | Frontend `index.html` | `/scene` returns a 5-channel per-cell tensor (`cell_type`, `fire`, `smoke`, `is_agent`, `is_visible`) plus structured `labels` (agent position/health/location, episode params, door registry) β€” consumed by the React frontend. --- ## Training Three training surfaces share the same PPO algorithm core from `train_torch_ppo.py`: ### 1. In-process (fastest) ```bash python training/ppo/train_torch_ppo.py \ --episodes 500 \ --device cuda \ --difficulty-schedule easy,medium,hard \ --patience-threshold 0.65 \ --output artifacts/pyre_ppo.pt ``` ### 2. HTTP (against live server) `train_torch_ppo_http.py` is a thin wrapper; it runs `train_torch_ppo.py` with `--server http://localhost:8000` and passes through any other flags. ```bash # Start server first uv run server # Equivalent: python training/ppo/train_torch_ppo.py --server http://localhost:8000 --episodes 300 python training/ppo/train_torch_ppo_http.py --episodes 300 ``` ### 3. Colab notebook (against HF Space) Open [`training/ppo/pyre_ppo_training.ipynb`](training/ppo/pyre_ppo_training.ipynb) or the [hosted Colab](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing). The notebook points `SERVER_URL` at `https://krooz-pyre-env.hf.space` and trains entirely over HTTP. ### Observation encoding The `ObservationEncoder` in `train_torch_ppo.py` encodes each `PyreObservation` into a **5,790-dim float32 vector** (`ObservationEncoder.base_dim`): ``` Grid: 24Γ—24 Γ— 10 channels = 5,760 β€’ 6 one-hot cell type (floor/wall/door_open/door_closed/exit/obstacle) β€’ fire intensity [0,1] β€’ smoke density [0,1] β€’ visibility mask (1=visible) β€’ agent position mask Scalars: 17 global features health, step_progress, fire_spread_rate, humidity, agent_x_norm, agent_y_norm, nearest_exit_distance, reachable_exit_count, visible_cell_count, fire_sources, smoke_severity, alive, evacuated, exit_dx_norm, exit_dy_norm, exit_manhattan_norm ← exit compass (map-agnostic) One-hots: wind (5) + difficulty (4: easy, medium, hard_fixed, hard) + route hint (4: N/S/W/E) = 13 Total: 5,760 + 17 + 13 = 5,790 ``` With `--history-length 4` (default), four frames are stacked: **input_dim = 23,160**. ### Network architecture ``` Input (23,160) β†’ LayerNorm β†’ FC(512) β†’ LayerNorm β†’ ReLU β†’ FC(256) β†’ LayerNorm β†’ ReLU β†’ FC(128) β†’ ReLU β”œβ”€β”€ Policy head β†’ FC(37) logits + action mask (βˆ’βˆž for invalid) └── Value head β†’ FC(1) scalar ``` The policy uses **37** discrete actions (4 move, 1 wait, 16 door-open, 16 door-close); `look` is not in the PPO head because the map encoder already carries visibility. Orthogonal init (√2 gain hidden layers, 0.01 policy head). Total parameters: **~12.1M** (varies slightly with `hidden-sizes`). ### Default PPO / curriculum flags - `--difficulty-schedule` default: `easy,medium,hard_fixed,hard` (full three-stage path including two hard modes). - Patience (defaults): threshold **0.65**, window **15** episodes, optional `--hard-mix-ratio` / `--hard-mix-dist` during the hard stage. ### Curriculum The `PatienceCurriculum` gating: ``` Stay on current stage until: success_rate (last 30 eps) β‰₯ patience_threshold (default 0.65) for patience_window (default 15) consecutive episodes β†’ then advance to the next stage in --difficulty-schedule Final stage: optional replay of the previous stage (--hard-mix-ratio, default 0.25) or a custom distribution (--hard-mix-dist), to limit forgetting ``` With the default schedule `easy,medium,hard_fixed,hard`, the β€œreplay” stage during **hard** is typically **`hard_fixed`**, not `medium`. The **`pyre_ppo_hard_v2`** run used an explicit `hard:…,medium:…,easy:…` mix instead; see `training/push_to_hub.py` for the exact flags. ### Push to HuggingFace Hub ```bash export HF_TOKEN=hf_... uv run python training/push_to_hub.py \ --repo-id Krooz/pyre-ppo-agent \ --stem pyre_ppo_hard_v2 \ --artifacts-dir artifacts ``` Uploads `{stem}.pt`, `{stem}.csv`, `{stem}.png`, `{stem}_eval.csv`, and generates a model card README. The script’s embedded summary targets the **`pyre_ppo_hard_v2`** HTTP run; adjust `--stem` if you use a different checkpoint prefix. Trained weights: **[Krooz/pyre-ppo-agent](https://huggingface.co/Krooz/pyre-ppo-agent)**. ### Training results ![Pyre PPO β€” HTTP run `pyre_ppo_hard_v2`, 600 episodes, easy β†’ medium β†’ hard](artifacts/pyre_ppo_hard_v2.png) **Primary run on record (`artifacts/pyre_ppo_hard_v2.*`): 600 HTTP episodes** with a patience-gated **easy β†’ medium β†’ hard** schedule, eval every 25 episodes on **hard** (see `training/push_to_hub.py` for the exact CLI, metrics table, and hub model-card text). Representative headline numbers from that run: **~55%** final training success rate (MA-20, graph title), **~52.7%** overall evacuation over all 600 episodes, and **~10.5%** evacuation on **hard** episodes within the runβ€”showing the agent still struggles on fully procedural hard maps while improving on easy/medium. **Earlier ablation (200 episodes, easy β†’ medium only):** a previous curve reached **~75%** success on **medium** after 200 episodes (no `hard` in the training mix). That artifact set is no longer in the tree; the figure and CSV above supersede it for the hackathon write-up. --- ## Frontend A cinematic real-time visualization built in React 19 + Vite + TypeScript. ```bash cd frontend npm install npm run dev # β†’ http://localhost:5173 ``` The map renderer (`src/components/Map2D.tsx`) uses HTML5 Canvas 2D with: - **5-layer volumetric fire**: dark-red base β†’ orange body β†’ yellow core β†’ white-hot tip β†’ wind-bent plume - **Ember particle system**: 200-max particles, wind-biased velocity, fade-out - **Animated walls**: brick texture with heat-tint shift and crack lines near fire - **Charred obstacles**: dark rubble cells with ember-glow when adjacent to fire - **Fog-of-war**: per-cell alpha overlay; fire beacon glow punches through fog - **Minecraft-style agent**: pixel-art character with health-based color theme (blueβ†’orangeβ†’redβ†’purple), gold health arc ring, and movement trail The agent color changes with HP: `healthy (β‰₯60%) β†’ blue`, `moderate (30–59%) β†’ orange`, `low (1–29%) β†’ red`, `critical (≀0%) β†’ purple`. The right side panel polls `/scene` every 500ms and shows tactical controls, per-door state (open/closed/failed), agent biometrics, environment stats, event log with reward annotations, and raw network activity. --- ## Deployment ```bash openenv push --repo-id your-org/pyre-env ``` The `openenv.yaml` manifest declares this as a FastAPI space on port 8000. Docker configuration is in `server/Dockerfile`. --- ## Roadmap The current architecture β€” cellular automaton physics, composable rubrics, BFS-based visibility, narrative observation layer, dual LLM+RL interface β€” is designed to generalise. Planned extensions: ### Other natural disasters The `FireSim` is one implementation of a physics layer. The same environment shell supports alternative calamity models with minimal changes: | Disaster | Physics swap | New mechanic | |---|---|---| | **Flood** | Water pressure + rising level grid | Agent must find high ground or exits before water fills corridors | | **Earthquake** | Probabilistic wall collapse | Rubble blocks form during episode; structural integrity per cell | | **Chemical spill** | Wind-borne toxin concentration | Invisible hazard; agent must infer spread direction from health decay | | **Wildfire (ground level)** | Existing fire sim, outdoor map | No walls, wind-dominated spread, sparse exits | Each shares the same reward rubric composability, observation layer, and training stack. ### NPC characters The floor plan templates already define `spawn_zones` and the state model has placeholders for multi-agent positions. Next steps: - Add panicking civilians who move randomly and block corridors - Rescue mechanic: escort NPCs to exits for bonus reward - Theory-of-mind challenge: agent must model NPC movement to plan around them - Competing agent: second RL agent racing for the same exit (mixed cooperative/competitive) ### 3D maps and multi-floor buildings - Stack floor levels connected by staircases - Fire spreads both horizontally and vertically through floor openings - 3D BFS cone-of-vision observation (currently 2D flood-fill) - Elevator shafts as high-risk shortcuts - Procedural multi-floor generator extending the existing Prim-MST approach ### LLM fine-tuning (GRPO) - `training/` already scaffolds GRPO infrastructure alongside PPO - Fine-tune a language model's policy directly on Pyre episode rollouts - Compare: PPO on structured grid vs GRPO on text narrative β€” does the LLM develop genuine spatial reasoning or pattern-match the narrative? ### Harder curriculum stages - `extreme` difficulty: procedural map, 5 fire sources, humidity 0–5%, always hurricane-force wind, 75 max steps - Dynamic difficulty adjustment: real-time difficulty scaling based on agent rolling success rate - Adversarial fire placement: second agent controls fire source positions to maximise agent failure --- ## Hackathon alignment - **Theme #2 β€” Long-Horizon Planning**: 50–200 step episodes; agent must build a mental map across many partial observations with no global state - **Theme #3.1 β€” World Modeling**: no global map; agent infers fire spread direction, corridor topology, and exit reachability from local first-person text observations alone