| --- |
| title: Pyre — Crisis Navigation Environment |
| emoji: 🔥 |
| colorFrom: red |
| colorTo: yellow |
| sdk: docker |
| pinned: false |
| app_port: 8000 |
| tags: |
| - openenv |
| --- |
| |
| # Pyre — Crisis Navigation Environment for LLM Agents |
|
|
| > *When buildings burn, the difference between a safe evacuation and a tragedy is the quality of decisions made in the first 60 seconds. Can we train an LLM to make them?* |
|
|
| **Pyre** places an LLM agent *inside* a burning building. The agent must navigate to safety under partial observability — no global map, a real health system, hard time pressure, and a fire that actively spreads, blocks exits, and permanently alters the floor plan. |
|
|
| **Links:** |
| 🔥 [Live Space](https://krooz-pyre-env.hf.space) | |
| 🤖 [Trained Model](https://huggingface.co/Krooz/pyre-ppo-agent) | |
| 📓 [Colab Training](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing) | |
| 📝 [Blog](BLOG.md) |
|
|
| --- |
|
|
| ## Why Pyre vs. existing environments |
|
|
| | Feature | `grid_world` | `maze_env` | `wildfire_env` | **Pyre** | |
| |---|---|---|---|---| |
| | Observability | Full | Full | Partial | **Partial, first-person, text** | |
| | Map dynamics | Static | Static | Dynamic (fire) | **Dynamic (fire + doors + burnout)** | |
| | Action richness | 4 moves | 4 moves | Suppression | **Movement + door control + look** | |
| | Agent role | Mover | Mover | Suppressor | **Survivor** | |
| | Reward complexity | Reach goal | Reach goal | Suppress fire | **14-component composite rubric** | |
|
|
| *`wildfire_env` trains an agent to fight fires from above; Pyre trains an agent to survive from inside.* |
|
|
| --- |
|
|
| ## Quick start |
|
|
| ```bash |
| uv sync |
| uv run server # → http://localhost:8000 |
| |
| # Health check |
| curl http://localhost:8000/health |
| |
| # Start episode |
| curl -X POST http://localhost:8000/reset \ |
| -H "Content-Type: application/json" \ |
| -d '{"difficulty": "medium"}' |
| |
| # Take a step |
| curl -X POST http://localhost:8000/step \ |
| -H "Content-Type: application/json" \ |
| -d '{"action": "move", "direction": "north"}' |
| |
| # Random baseline (smoke test) |
| python examples/random_agent.py --episodes 5 --verbose |
| ``` |
|
|
| ### Python client |
|
|
| ```python |
| from pyre_env import PyreEnv, PyreAction |
| |
| with PyreEnv(base_url="http://localhost:8000") as env: |
| result = env.reset() |
| print(result.observation.narrative) |
| result = env.step(PyreAction(action="move", direction="north")) |
| print(f"Reward: {result.reward:.3f} | HP: {result.observation.agent_health}") |
| ``` |
|
|
| ### Environment variables |
|
|
| | Variable | Default | Description | |
| |---|---|---| |
| | `PORT` | `8000` | HTTP server port | |
| | `PYRE_MAX_STEPS` | `150` | Default max steps per episode (overridden by difficulty preset) | |
| | `PYRE_SEED` | `42` | Base RNG seed; each episode increments by 37 | |
| | `HF_TOKEN` | — | Required only for `training/push_to_hub.py` | |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| reset() / step() |
| │ |
| ▼ |
| PyreEnvironment server/pyre_env_environment.py |
| ├── floor_plan.py Building template or procedural generation |
| ├── fire_sim.py Cellular automaton: spread → intensity → smoke |
| ├── narrative.py BFS visibility → first-person text + structured fields |
| └── rubrics.py 14 composable reward components |
| │ |
| ▼ |
| PyreObservation models.py |
| ├── narrative str — primary LLM input |
| ├── map_state PyreMapState — full grid snapshot for RL encoders |
| ├── reward float |
| ├── done bool |
| └── metadata dict — fire params, distances, difficulty |
| ``` |
|
|
| ### Data flow per step |
|
|
| 1. `_execute_action()` — move / door / look / wait, returns feedback string |
| 2. Check evacuation — agent on EXIT cell with `fire < 0.5` → success |
| 3. `FireSim.step()` — advance fire, smoke, burn timers; may convert cells to OBSTACLE |
| 4. Apply health damage from smoke (0.5–5 HP/step) and fire (10 HP/step) |
| 5. `_compute_reward()` — call all 14 rubrics with shared kwargs |
| 6. `build_narrative_observation()` — BFS visibility, compose text, collect action hints |
| 7. `_build_map_state()` — assemble full grid snapshot for UI / RL encoder |
| 8. Return `PyreObservation` |
|
|
| --- |
|
|
| ## Project structure |
|
|
| ``` |
| pyre_env/ |
| ├── models.py PyreAction, PyreObservation, PyreMapState, PyreState |
| ├── client.py PyreEnv (EnvClient subclass, narrative-focused) |
| ├── openenv.yaml OpenEnv manifest (space, fastapi, port 8000) |
| ├── pyproject.toml |
| │ |
| ├── server/ |
| │ ├── app.py FastAPI bootstrap; stateful /reset, /step, /state, /scene |
| │ ├── pyre_env_environment.py PyreEnvironment state machine + difficulty presets |
| │ ├── floor_plan.py 3 hand-authored templates + procedural generator |
| │ ├── fire_sim.py Cellular automaton fire/smoke simulation |
| │ ├── narrative.py BFS visibility + first-person text renderer |
| │ └── rubrics.py 14 composable reward rubric classes |
| │ |
| ├── frontend/ |
| │ ├── src/ |
| │ │ ├── App.tsx Dashboard shell: topbar, canvas zone, side panel |
| │ │ ├── components/Map2D.tsx Canvas2D renderer: fire, smoke, fog-of-war, agent |
| │ │ ├── components/HUD.tsx HP bar, wind compass, step counter overlay |
| │ │ ├── components/ControlPanel.tsx Move/door controls, difficulty, auto-wait |
| │ │ └── components/StatusCard.tsx Agent biometrics, environment stats |
| │ └── README.md Frontend setup and demo script |
| │ |
| ├── training/ |
| │ ├── ppo/ |
| │ │ ├── train_torch_ppo.py PPO (in-process or `--server` for HTTP EnvClient) |
| │ │ ├── train_torch_ppo_http.py Thin wrapper: forwards argv to `train_torch_ppo.py --server ...` |
| │ │ └── pyre_ppo_training.ipynb Colab notebook (self-contained, talks to HF Space) |
| │ └── push_to_hub.py Upload checkpoint + metrics to HuggingFace Hub |
| │ |
| ├── examples/ |
| │ └── random_agent.py Baseline: 70% hint-biased, 30% random |
| │ |
| └── artifacts/ Training outputs: .pt, .csv, .png |
| ``` |
|
|
| --- |
|
|
| ## Simulation layer |
|
|
| ### Fire simulation (`server/fire_sim.py`) |
| |
| A stochastic cellular automaton over a flat row-major grid. Each call to `FireSim.step()` runs three phases: |
| |
| **Phase 1 — Ignition.** Any cell with `fire ≥ FIRE_BURNING (0.3)` tries to ignite each cardinal neighbor: |
|
|
| ``` |
| p_ignite = p_spread × (1 − humidity) × wind_multiplier × fuel_map[neighbor] |
| ``` |
|
|
| - **Wind multiplier**: dot product of spread direction with wind vector → downwind 2×, upwind 0.5×, crosswind 1× |
| - **Closed doors**: `DOOR_CLOSED_FIRE_FACTOR = 0.15` (fire crosses at 15% normal rate) |
| - **Fuel map**: per-cell float from `floor_plan.py`; office rooms 1.5×, exits 0.6× |
|
|
| **Phase 2 — Intensity.** Existing fire gains `FIRE_INTENSITY_GAIN (0.15) × fuel_map[i]` per step. When `burn_timer ≥ BURNOUT_TICKS (5)` and intensity reaches 1.0, the cell becomes `OBSTACLE` — permanently impassable rubble. |
|
|
| **Phase 3 — Smoke.** Smoke is sourced at +0.3/step for cells with `fire ≥ 0.3`, diffuses between neighbors at `SMOKE_SPREAD_RATE (0.20)`, passes through closed doors at 40% rate, and decays per cell according to `ventilation_map`. |
|
|
| **Key constants:** |
|
|
| | Constant | Value | Role | |
| |---|---|---| |
| | `FIRE_IGNITION` | 0.1 | Starting intensity for new ignitions | |
| | `FIRE_BURNING` | 0.3 | Threshold for spreading and causing damage | |
| | `FIRE_INTENSITY_GAIN` | 0.15 | Intensity added per step to burning cell | |
| | `BURNOUT_TICKS` | 5 | Steps at full intensity before cell → OBSTACLE | |
| | `DOOR_CLOSED_FIRE_FACTOR` | 0.15 | Fire spread multiplier through closed doors | |
| | `SMOKE_SPREAD_RATE` | 0.20 | Smoke diffusion rate between neighbors | |
| | `SMOKE_DOOR_FACTOR` | 0.40 | Smoke rate through closed doors | |
| | `EXIT_BLOCKED_FIRE_THRESHOLD` | 0.5 | Fire intensity at which an exit is considered blocked | |
|
|
| ### Building templates (`server/floor_plan.py`) |
| |
| Three hand-authored 16×16 templates for easy and medium difficulty: |
| |
| | Template | Layout | Exits | Doors | Notes | |
| |---|---|---|---|---| |
| | `small_office` | Two corridor bands + office rooms N/S | 2 (W, E walls) | 8 (room↔corridor) | Agent spawns in corridor | |
| | `open_plan` | Open hall with 4 × 2×2 pillar obstacles | 2 (diagonal corners) | 0 | High ventilation throughout | |
| | `t_corridor` | T-shaped: vertical stem + horizontal bar | 3 (top, left, right) | 4 (rooms off stem) | Multiple route decisions | |
|
|
| Each template carries a `zone_map` (cell → zone label), derived `fuel_map`, and `ventilation_map`: |
|
|
| | Zone | Fuel multiplier | Smoke decay/step | Notes | |
| |---|---|---|---| |
| | `north/south_offices` | 1.5× | 0.010 | High fuel, poor ventilation | |
| | `west/east_rooms` | 1.5× | 0.010 | Same as offices | |
| | `main_corridor` | 1.0× | 0.028 | Baseline | |
| | `northwest/northeast/etc. hall` | 0.9× | 0.050 | Open plan — best ventilation | |
| | `exit` | 0.6× | 0.040 | Concrete, vented | |
|
|
| **Hard mode — procedural generation.** Episodes run on a freshly generated 20×24 floor plan every time: |
|
|
| 1. **Room placement**: random non-overlapping rectangles (3–5 × 3–4 cells, 6–10 rooms) |
| 2. **MST corridors**: Prim-style minimum spanning tree connecting room centers via L-shaped tunnels |
| 3. **Exit placement**: deterministic tunnels from leftmost/rightmost floor cells to outer walls |
| 4. **Connectivity guard**: BFS from agent spawn verifies ≥1 exit is reachable; up to 3 attempts; falls back to `small_office` |
|
|
| ### Visibility (`server/narrative.py`) |
|
|
| BFS flood-fill from agent position, walls block expansion: |
|
|
| | Agent smoke level | Visibility radius | |
| |---|---| |
| | None / light (`< 0.5`) | 5 cells | |
| | Moderate (`0.5–0.8`) | 3 cells | |
| | Heavy (`≥ 0.8`) | 2 cells | |
|
|
| --- |
|
|
| ## What the agent sees |
|
|
| Every step, `narrative.py` assembles a first-person text observation from raw grid state: |
|
|
| ``` |
| You are in the **main_corridor**. The air is **moderate**. |
| Health: ████████░░ (85/100) | Wind: **EAST** |
| Flames are visible to the **west**. |
| Exits visible: exit_0_7 at 8m west. |
| Doors: door_1 (closed) at 2m east. |
| You hear: Fire alarm sounding; Smoke detector beeping. |
| Last action: You move south. The smoke is thick here. |
| Available actions: move(direction='north') move(direction='south') |
| door(target_id='door_1', door_state='open') look(direction='east') wait() |
| ``` |
|
|
| The same state is also exposed as structured fields in `PyreObservation` (smoke level, fire direction, visible objects, blocked exits, action hints) and as a full grid snapshot in `PyreObservation.map_state` for programmatic / RL use. |
|
|
| --- |
|
|
| ## Action space |
|
|
| | Action | Parameters | Effect | |
| |---|---|---| |
| | `move` | `direction: north\|south\|east\|west` | Move one cell; blocked by walls, obstacles, closed doors | |
| | `door` | `target_id: str`, `door_state: open\|close` | Open or close a door within 2 cells Manhattan distance | |
| | `look` | `direction: north\|south\|east\|west` | Ray-scan up to 5 cells; returns per-cell smoke/fire/zone/door/exit detail. Time still advances. | |
| | `wait` | — | Skip turn | |
|
|
| --- |
|
|
| ## Reward function — all 14 components |
|
|
| ### Per-step rubrics |
|
|
| | Class | Value | Condition | |
| |---|---|---| |
| | `TimeStepPenalty` | −0.01 | Every step | |
| | `ProgressReward` | +0.25 | `move` reduced BFS distance to nearest unblocked exit | |
| | `ProgressRegressionPenalty` | −0.15 | `move` increased BFS distance to nearest exit | |
| | `SafeProgressBonus` | +0.05 | Progress AND new cell has `smoke < 0.5` | |
| | `DangerPenalty` | −0.50 | `move` into cell with `smoke ≥ 0.5` OR adjacent to `fire ≥ 0.3` | |
| | `HealthDrainPenalty` | −0.02 × dmg | Proportional to HP lost this step | |
| | `StrategicDoorBonus` | +0.50 | Closed a door with a cardinal neighbor `fire ≥ 0.3`; once per door per episode | |
| | `ExplorationBonus` | +0.02 | `move` to a cell not visited this episode | |
|
|
| ### Episode-end rubrics (fire only when `done=True`) |
|
|
| | Class | Value | Condition | |
| |---|---|---| |
| | `SelfSurviveBonus` | +5.0 | Agent evacuated alive | |
| | `HealthSurvivalBonus` | +1.5 × (hp/100) | Agent evacuated (range 0 → +1.5) | |
| | `SelfDeathPenalty` | −10.0 | Agent died (HP ≤ 0) | |
| | `TimeoutPenalty` | −5.0 to −8.0 | Alive but out of steps; scaled by `−5 − 3×(hp/100)` when exits were reachable | |
| | `NearMissBonus` | max(0, 3.0 − 0.5 × min_BFS_dist) | On death only; `min_BFS_dist` = closest BFS distance to any exit reached this episode | |
| | `TimeBonus` | +0.05 × remaining_steps | Agent evacuated | |
| |
| **BFS note:** `ProgressReward`, `ProgressRegressionPenalty`, `SafeProgressBonus`, and `NearMissBonus` all use true BFS traversal distance (walls and obstacles block; closed doors are treated as passable so the reward models optimal reachability assuming doors can be opened). The PPO trainer’s exit “pull” (below) uses **Manhattan** distance to listed exit cells only for an extra shaping signal — it is not part of the environment rubric. |
| |
| ### PPO training script only (`training/ppo/train_torch_ppo.py`) |
| |
| The server’s step reward above is **further adjusted** inside the training loop (not returned to HTTP clients): **−0.05** on `wait`; **−0.15** after a `move` if any cardinal neighbor has fire **> 0.15**; **−0.20** if the new position was already in the last **12** positions this episode; **+ max(0, 0.25 − 0.04 × d)** on `move` when not yet evacuated, where **d** is **Manhattan** distance to the nearest cell in `map_state.exit_positions`. |
| |
| --- |
| |
| ## Difficulty presets |
| |
| | Level | Sources | Spread rate | Humidity | Wind | Max steps | Map | |
| |---|---|---|---|---|---|---| |
| | `easy` | 1 | 10–20% | 30–50% | CALM only | 200 | Fixed 16×16 templates | |
| | `medium` | 2–4 | 15–40% | 10–45% | Any | 150 | Fixed 16×16 templates | |
| | `hard` | 3–5 | 30–55% | 5–20% | Never CALM | 100 | Procedural 20×24 | |
| |
| **Health damage rates** (applied after fire sim step): |
| |
| | Condition | HP/step | |
| |---|---| |
| | Light smoke (`0.2–0.5`) | 0.5 | |
| | Moderate smoke (`0.5–0.8`) | 2.0 | |
| | Heavy smoke (`≥ 0.8`) | 5.0 | |
| | On fire (`fire ≥ 0.3`) | 10.0 | |
| |
| Smoke and fire damage stack if both conditions apply. |
| |
| --- |
| |
| ## HTTP API |
| |
| The FastAPI server exposes both the standard OpenEnv routes and additional endpoints: |
| |
| | Method | Path | Body | Returns | |
| |---|---|---|---| |
| | `GET` | `/health` | — | `{"status": "ok"}` | |
| | `POST` | `/reset` | `{"difficulty": "medium", "seed": null}` | `{observation, reward, done, metadata}` | |
| | `POST` | `/step` | `{"action": "move", "direction": "north"}` | `{observation, reward, done, metadata}` | |
| | `GET` | `/state` | — | Full `PyreState` dump | |
| | `GET` | `/scene` | — | Structured scene graph for UI renderers | |
| | `GET` | `/` | — | Frontend `index.html` | |
| |
| `/scene` returns a 5-channel per-cell tensor (`cell_type`, `fire`, `smoke`, `is_agent`, `is_visible`) plus structured `labels` (agent position/health/location, episode params, door registry) — consumed by the React frontend. |
|
|
| --- |
|
|
| ## Training |
|
|
| Three training surfaces share the same PPO algorithm core from `train_torch_ppo.py`: |
|
|
| ### 1. In-process (fastest) |
|
|
| ```bash |
| python training/ppo/train_torch_ppo.py \ |
| --episodes 500 \ |
| --device cuda \ |
| --difficulty-schedule easy,medium,hard \ |
| --patience-threshold 0.65 \ |
| --output artifacts/pyre_ppo.pt |
| ``` |
|
|
| ### 2. HTTP (against live server) |
|
|
| `train_torch_ppo_http.py` is a thin wrapper; it runs `train_torch_ppo.py` with `--server http://localhost:8000` and passes through any other flags. |
|
|
| ```bash |
| # Start server first |
| uv run server |
| |
| # Equivalent: python training/ppo/train_torch_ppo.py --server http://localhost:8000 --episodes 300 |
| python training/ppo/train_torch_ppo_http.py --episodes 300 |
| ``` |
|
|
| ### 3. Colab notebook (against HF Space) |
|
|
| Open [`training/ppo/pyre_ppo_training.ipynb`](training/ppo/pyre_ppo_training.ipynb) or the [hosted Colab](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing). The notebook points `SERVER_URL` at `https://krooz-pyre-env.hf.space` and trains entirely over HTTP. |
|
|
| ### Observation encoding |
|
|
| The `ObservationEncoder` in `train_torch_ppo.py` encodes each `PyreObservation` into a **5,790-dim float32 vector** (`ObservationEncoder.base_dim`): |
|
|
| ``` |
| Grid: 24×24 × 10 channels = 5,760 |
| • 6 one-hot cell type (floor/wall/door_open/door_closed/exit/obstacle) |
| • fire intensity [0,1] |
| • smoke density [0,1] |
| • visibility mask (1=visible) |
| • agent position mask |
| |
| Scalars: 17 global features |
| health, step_progress, fire_spread_rate, humidity, |
| agent_x_norm, agent_y_norm, nearest_exit_distance, |
| reachable_exit_count, visible_cell_count, fire_sources, |
| smoke_severity, alive, evacuated, |
| exit_dx_norm, exit_dy_norm, exit_manhattan_norm ← exit compass (map-agnostic) |
| |
| One-hots: wind (5) + difficulty (4: easy, medium, hard_fixed, hard) + route hint (4: N/S/W/E) = 13 |
| |
| Total: 5,760 + 17 + 13 = 5,790 |
| ``` |
|
|
| With `--history-length 4` (default), four frames are stacked: **input_dim = 23,160**. |
| |
| ### Network architecture |
| |
| ``` |
| Input (23,160) |
| → LayerNorm → FC(512) → LayerNorm → ReLU |
| → FC(256) → LayerNorm → ReLU |
| → FC(128) → ReLU |
| ├── Policy head → FC(37) logits + action mask (−∞ for invalid) |
| └── Value head → FC(1) scalar |
| ``` |
| |
| The policy uses **37** discrete actions (4 move, 1 wait, 16 door-open, 16 door-close); `look` is not in the PPO head because the map encoder already carries visibility. Orthogonal init (√2 gain hidden layers, 0.01 policy head). Total parameters: **~12.1M** (varies slightly with `hidden-sizes`). |
| |
| ### Default PPO / curriculum flags |
| |
| - `--difficulty-schedule` default: `easy,medium,hard_fixed,hard` (full three-stage path including two hard modes). |
| - Patience (defaults): threshold **0.65**, window **15** episodes, optional `--hard-mix-ratio` / `--hard-mix-dist` during the hard stage. |
|
|
| ### Curriculum |
|
|
| The `PatienceCurriculum` gating: |
|
|
| ``` |
| Stay on current stage until: |
| success_rate (last 30 eps) ≥ patience_threshold (default 0.65) |
| for patience_window (default 15) consecutive episodes |
| → then advance to the next stage in --difficulty-schedule |
| |
| Final stage: optional replay of the previous stage (--hard-mix-ratio, default 0.25) |
| or a custom distribution (--hard-mix-dist), to limit forgetting |
| ``` |
|
|
| With the default schedule `easy,medium,hard_fixed,hard`, the “replay” stage during **hard** is typically **`hard_fixed`**, not `medium`. The **`pyre_ppo_hard_v2`** run used an explicit `hard:…,medium:…,easy:…` mix instead; see `training/push_to_hub.py` for the exact flags. |
|
|
| ### Push to HuggingFace Hub |
|
|
| ```bash |
| export HF_TOKEN=hf_... |
| uv run python training/push_to_hub.py \ |
| --repo-id Krooz/pyre-ppo-agent \ |
| --stem pyre_ppo_hard_v2 \ |
| --artifacts-dir artifacts |
| ``` |
|
|
| Uploads `{stem}.pt`, `{stem}.csv`, `{stem}.png`, `{stem}_eval.csv`, and generates a model card README. The script’s embedded summary targets the **`pyre_ppo_hard_v2`** HTTP run; adjust `--stem` if you use a different checkpoint prefix. Trained weights: **[Krooz/pyre-ppo-agent](https://huggingface.co/Krooz/pyre-ppo-agent)**. |
| |
| ### Training results |
| |
|  |
| |
| **Primary run on record (`artifacts/pyre_ppo_hard_v2.*`): 600 HTTP episodes** with a patience-gated **easy → medium → hard** schedule, eval every 25 episodes on **hard** (see `training/push_to_hub.py` for the exact CLI, metrics table, and hub model-card text). Representative headline numbers from that run: **~55%** final training success rate (MA-20, graph title), **~52.7%** overall evacuation over all 600 episodes, and **~10.5%** evacuation on **hard** episodes within the run—showing the agent still struggles on fully procedural hard maps while improving on easy/medium. |
|
|
| **Earlier ablation (200 episodes, easy → medium only):** a previous curve reached **~75%** success on **medium** after 200 episodes (no `hard` in the training mix). That artifact set is no longer in the tree; the figure and CSV above supersede it for the hackathon write-up. |
|
|
| --- |
|
|
| ## Frontend |
|
|
| A cinematic real-time visualization built in React 19 + Vite + TypeScript. |
|
|
| ```bash |
| cd frontend |
| npm install |
| npm run dev # → http://localhost:5173 |
| ``` |
|
|
| The map renderer (`src/components/Map2D.tsx`) uses HTML5 Canvas 2D with: |
| - **5-layer volumetric fire**: dark-red base → orange body → yellow core → white-hot tip → wind-bent plume |
| - **Ember particle system**: 200-max particles, wind-biased velocity, fade-out |
| - **Animated walls**: brick texture with heat-tint shift and crack lines near fire |
| - **Charred obstacles**: dark rubble cells with ember-glow when adjacent to fire |
| - **Fog-of-war**: per-cell alpha overlay; fire beacon glow punches through fog |
| - **Minecraft-style agent**: pixel-art character with health-based color theme (blue→orange→red→purple), gold health arc ring, and movement trail |
|
|
| The agent color changes with HP: `healthy (≥60%) → blue`, `moderate (30–59%) → orange`, `low (1–29%) → red`, `critical (≤0%) → purple`. |
|
|
| The right side panel polls `/scene` every 500ms and shows tactical controls, per-door state (open/closed/failed), agent biometrics, environment stats, event log with reward annotations, and raw network activity. |
|
|
| --- |
|
|
| ## Deployment |
|
|
| ```bash |
| openenv push --repo-id your-org/pyre-env |
| ``` |
|
|
| The `openenv.yaml` manifest declares this as a FastAPI space on port 8000. Docker configuration is in `server/Dockerfile`. |
|
|
| --- |
|
|
| ## Roadmap |
|
|
| The current architecture — cellular automaton physics, composable rubrics, BFS-based visibility, narrative observation layer, dual LLM+RL interface — is designed to generalise. Planned extensions: |
|
|
| ### Other natural disasters |
| The `FireSim` is one implementation of a physics layer. The same environment shell supports alternative calamity models with minimal changes: |
|
|
| | Disaster | Physics swap | New mechanic | |
| |---|---|---| |
| | **Flood** | Water pressure + rising level grid | Agent must find high ground or exits before water fills corridors | |
| | **Earthquake** | Probabilistic wall collapse | Rubble blocks form during episode; structural integrity per cell | |
| | **Chemical spill** | Wind-borne toxin concentration | Invisible hazard; agent must infer spread direction from health decay | |
| | **Wildfire (ground level)** | Existing fire sim, outdoor map | No walls, wind-dominated spread, sparse exits | |
|
|
| Each shares the same reward rubric composability, observation layer, and training stack. |
|
|
| ### NPC characters |
| The floor plan templates already define `spawn_zones` and the state model has placeholders for multi-agent positions. Next steps: |
| - Add panicking civilians who move randomly and block corridors |
| - Rescue mechanic: escort NPCs to exits for bonus reward |
| - Theory-of-mind challenge: agent must model NPC movement to plan around them |
| - Competing agent: second RL agent racing for the same exit (mixed cooperative/competitive) |
|
|
| ### 3D maps and multi-floor buildings |
| - Stack floor levels connected by staircases |
| - Fire spreads both horizontally and vertically through floor openings |
| - 3D BFS cone-of-vision observation (currently 2D flood-fill) |
| - Elevator shafts as high-risk shortcuts |
| - Procedural multi-floor generator extending the existing Prim-MST approach |
|
|
| ### LLM fine-tuning (GRPO) |
| - `training/` already scaffolds GRPO infrastructure alongside PPO |
| - Fine-tune a language model's policy directly on Pyre episode rollouts |
| - Compare: PPO on structured grid vs GRPO on text narrative — does the LLM develop genuine spatial reasoning or pattern-match the narrative? |
|
|
| ### Harder curriculum stages |
| - `extreme` difficulty: procedural map, 5 fire sources, humidity 0–5%, always hurricane-force wind, 75 max steps |
| - Dynamic difficulty adjustment: real-time difficulty scaling based on agent rolling success rate |
| - Adversarial fire placement: second agent controls fire source positions to maximise agent failure |
|
|
| --- |
|
|
| ## Hackathon alignment |
|
|
| - **Theme #2 — Long-Horizon Planning**: 50–200 step episodes; agent must build a mental map across many partial observations with no global state |
| - **Theme #3.1 — World Modeling**: no global map; agent infers fire spread direction, corridor topology, and exit reachability from local first-person text observations alone |
|
|