pyre_env / README.md
Krooz's picture
Upload folder using huggingface_hub
f5a6010 verified
---
title: Pyre Crisis Navigation Environment
emoji: 🔥
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
---
# Pyre — Crisis Navigation Environment for LLM Agents
> *When buildings burn, the difference between a safe evacuation and a tragedy is the quality of decisions made in the first 60 seconds. Can we train an LLM to make them?*
**Pyre** places an LLM agent *inside* a burning building. The agent must navigate to safety under partial observability — no global map, a real health system, hard time pressure, and a fire that actively spreads, blocks exits, and permanently alters the floor plan.
**Links:**
🔥 [Live Space](https://krooz-pyre-env.hf.space)  | 
🤖 [Trained Model](https://huggingface.co/Krooz/pyre-ppo-agent)  | 
📓 [Colab Training](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing)  | 
📝 [Blog](BLOG.md)
---
## Why Pyre vs. existing environments
| Feature | `grid_world` | `maze_env` | `wildfire_env` | **Pyre** |
|---|---|---|---|---|
| Observability | Full | Full | Partial | **Partial, first-person, text** |
| Map dynamics | Static | Static | Dynamic (fire) | **Dynamic (fire + doors + burnout)** |
| Action richness | 4 moves | 4 moves | Suppression | **Movement + door control + look** |
| Agent role | Mover | Mover | Suppressor | **Survivor** |
| Reward complexity | Reach goal | Reach goal | Suppress fire | **14-component composite rubric** |
*`wildfire_env` trains an agent to fight fires from above; Pyre trains an agent to survive from inside.*
---
## Quick start
```bash
uv sync
uv run server # → http://localhost:8000
# Health check
curl http://localhost:8000/health
# Start episode
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"difficulty": "medium"}'
# Take a step
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action": "move", "direction": "north"}'
# Random baseline (smoke test)
python examples/random_agent.py --episodes 5 --verbose
```
### Python client
```python
from pyre_env import PyreEnv, PyreAction
with PyreEnv(base_url="http://localhost:8000") as env:
result = env.reset()
print(result.observation.narrative)
result = env.step(PyreAction(action="move", direction="north"))
print(f"Reward: {result.reward:.3f} | HP: {result.observation.agent_health}")
```
### Environment variables
| Variable | Default | Description |
|---|---|---|
| `PORT` | `8000` | HTTP server port |
| `PYRE_MAX_STEPS` | `150` | Default max steps per episode (overridden by difficulty preset) |
| `PYRE_SEED` | `42` | Base RNG seed; each episode increments by 37 |
| `HF_TOKEN` | — | Required only for `training/push_to_hub.py` |
---
## Architecture
```
reset() / step()
PyreEnvironment server/pyre_env_environment.py
├── floor_plan.py Building template or procedural generation
├── fire_sim.py Cellular automaton: spread → intensity → smoke
├── narrative.py BFS visibility → first-person text + structured fields
└── rubrics.py 14 composable reward components
PyreObservation models.py
├── narrative str — primary LLM input
├── map_state PyreMapState — full grid snapshot for RL encoders
├── reward float
├── done bool
└── metadata dict — fire params, distances, difficulty
```
### Data flow per step
1. `_execute_action()` — move / door / look / wait, returns feedback string
2. Check evacuation — agent on EXIT cell with `fire < 0.5` → success
3. `FireSim.step()` — advance fire, smoke, burn timers; may convert cells to OBSTACLE
4. Apply health damage from smoke (0.5–5 HP/step) and fire (10 HP/step)
5. `_compute_reward()` — call all 14 rubrics with shared kwargs
6. `build_narrative_observation()` — BFS visibility, compose text, collect action hints
7. `_build_map_state()` — assemble full grid snapshot for UI / RL encoder
8. Return `PyreObservation`
---
## Project structure
```
pyre_env/
├── models.py PyreAction, PyreObservation, PyreMapState, PyreState
├── client.py PyreEnv (EnvClient subclass, narrative-focused)
├── openenv.yaml OpenEnv manifest (space, fastapi, port 8000)
├── pyproject.toml
├── server/
│ ├── app.py FastAPI bootstrap; stateful /reset, /step, /state, /scene
│ ├── pyre_env_environment.py PyreEnvironment state machine + difficulty presets
│ ├── floor_plan.py 3 hand-authored templates + procedural generator
│ ├── fire_sim.py Cellular automaton fire/smoke simulation
│ ├── narrative.py BFS visibility + first-person text renderer
│ └── rubrics.py 14 composable reward rubric classes
├── frontend/
│ ├── src/
│ │ ├── App.tsx Dashboard shell: topbar, canvas zone, side panel
│ │ ├── components/Map2D.tsx Canvas2D renderer: fire, smoke, fog-of-war, agent
│ │ ├── components/HUD.tsx HP bar, wind compass, step counter overlay
│ │ ├── components/ControlPanel.tsx Move/door controls, difficulty, auto-wait
│ │ └── components/StatusCard.tsx Agent biometrics, environment stats
│ └── README.md Frontend setup and demo script
├── training/
│ ├── ppo/
│ │ ├── train_torch_ppo.py PPO (in-process or `--server` for HTTP EnvClient)
│ │ ├── train_torch_ppo_http.py Thin wrapper: forwards argv to `train_torch_ppo.py --server ...`
│ │ └── pyre_ppo_training.ipynb Colab notebook (self-contained, talks to HF Space)
│ └── push_to_hub.py Upload checkpoint + metrics to HuggingFace Hub
├── examples/
│ └── random_agent.py Baseline: 70% hint-biased, 30% random
└── artifacts/ Training outputs: .pt, .csv, .png
```
---
## Simulation layer
### Fire simulation (`server/fire_sim.py`)
A stochastic cellular automaton over a flat row-major grid. Each call to `FireSim.step()` runs three phases:
**Phase 1 — Ignition.** Any cell with `fire ≥ FIRE_BURNING (0.3)` tries to ignite each cardinal neighbor:
```
p_ignite = p_spread × (1 − humidity) × wind_multiplier × fuel_map[neighbor]
```
- **Wind multiplier**: dot product of spread direction with wind vector → downwind 2×, upwind 0.5×, crosswind 1×
- **Closed doors**: `DOOR_CLOSED_FIRE_FACTOR = 0.15` (fire crosses at 15% normal rate)
- **Fuel map**: per-cell float from `floor_plan.py`; office rooms 1.5×, exits 0.6×
**Phase 2 — Intensity.** Existing fire gains `FIRE_INTENSITY_GAIN (0.15) × fuel_map[i]` per step. When `burn_timer ≥ BURNOUT_TICKS (5)` and intensity reaches 1.0, the cell becomes `OBSTACLE` — permanently impassable rubble.
**Phase 3 — Smoke.** Smoke is sourced at +0.3/step for cells with `fire ≥ 0.3`, diffuses between neighbors at `SMOKE_SPREAD_RATE (0.20)`, passes through closed doors at 40% rate, and decays per cell according to `ventilation_map`.
**Key constants:**
| Constant | Value | Role |
|---|---|---|
| `FIRE_IGNITION` | 0.1 | Starting intensity for new ignitions |
| `FIRE_BURNING` | 0.3 | Threshold for spreading and causing damage |
| `FIRE_INTENSITY_GAIN` | 0.15 | Intensity added per step to burning cell |
| `BURNOUT_TICKS` | 5 | Steps at full intensity before cell → OBSTACLE |
| `DOOR_CLOSED_FIRE_FACTOR` | 0.15 | Fire spread multiplier through closed doors |
| `SMOKE_SPREAD_RATE` | 0.20 | Smoke diffusion rate between neighbors |
| `SMOKE_DOOR_FACTOR` | 0.40 | Smoke rate through closed doors |
| `EXIT_BLOCKED_FIRE_THRESHOLD` | 0.5 | Fire intensity at which an exit is considered blocked |
### Building templates (`server/floor_plan.py`)
Three hand-authored 16×16 templates for easy and medium difficulty:
| Template | Layout | Exits | Doors | Notes |
|---|---|---|---|---|
| `small_office` | Two corridor bands + office rooms N/S | 2 (W, E walls) | 8 (room↔corridor) | Agent spawns in corridor |
| `open_plan` | Open hall with 4 × 2×2 pillar obstacles | 2 (diagonal corners) | 0 | High ventilation throughout |
| `t_corridor` | T-shaped: vertical stem + horizontal bar | 3 (top, left, right) | 4 (rooms off stem) | Multiple route decisions |
Each template carries a `zone_map` (cell → zone label), derived `fuel_map`, and `ventilation_map`:
| Zone | Fuel multiplier | Smoke decay/step | Notes |
|---|---|---|---|
| `north/south_offices` | 1.5× | 0.010 | High fuel, poor ventilation |
| `west/east_rooms` | 1.5× | 0.010 | Same as offices |
| `main_corridor` | 1.0× | 0.028 | Baseline |
| `northwest/northeast/etc. hall` | 0.9× | 0.050 | Open plan — best ventilation |
| `exit` | 0.6× | 0.040 | Concrete, vented |
**Hard mode — procedural generation.** Episodes run on a freshly generated 20×24 floor plan every time:
1. **Room placement**: random non-overlapping rectangles (3–5 × 3–4 cells, 6–10 rooms)
2. **MST corridors**: Prim-style minimum spanning tree connecting room centers via L-shaped tunnels
3. **Exit placement**: deterministic tunnels from leftmost/rightmost floor cells to outer walls
4. **Connectivity guard**: BFS from agent spawn verifies ≥1 exit is reachable; up to 3 attempts; falls back to `small_office`
### Visibility (`server/narrative.py`)
BFS flood-fill from agent position, walls block expansion:
| Agent smoke level | Visibility radius |
|---|---|
| None / light (`< 0.5`) | 5 cells |
| Moderate (`0.5–0.8`) | 3 cells |
| Heavy (`≥ 0.8`) | 2 cells |
---
## What the agent sees
Every step, `narrative.py` assembles a first-person text observation from raw grid state:
```
You are in the **main_corridor**. The air is **moderate**.
Health: ████████░░ (85/100) | Wind: **EAST**
Flames are visible to the **west**.
Exits visible: exit_0_7 at 8m west.
Doors: door_1 (closed) at 2m east.
You hear: Fire alarm sounding; Smoke detector beeping.
Last action: You move south. The smoke is thick here.
Available actions: move(direction='north') move(direction='south')
door(target_id='door_1', door_state='open') look(direction='east') wait()
```
The same state is also exposed as structured fields in `PyreObservation` (smoke level, fire direction, visible objects, blocked exits, action hints) and as a full grid snapshot in `PyreObservation.map_state` for programmatic / RL use.
---
## Action space
| Action | Parameters | Effect |
|---|---|---|
| `move` | `direction: north\|south\|east\|west` | Move one cell; blocked by walls, obstacles, closed doors |
| `door` | `target_id: str`, `door_state: open\|close` | Open or close a door within 2 cells Manhattan distance |
| `look` | `direction: north\|south\|east\|west` | Ray-scan up to 5 cells; returns per-cell smoke/fire/zone/door/exit detail. Time still advances. |
| `wait` | — | Skip turn |
---
## Reward function — all 14 components
### Per-step rubrics
| Class | Value | Condition |
|---|---|---|
| `TimeStepPenalty` | −0.01 | Every step |
| `ProgressReward` | +0.25 | `move` reduced BFS distance to nearest unblocked exit |
| `ProgressRegressionPenalty` | −0.15 | `move` increased BFS distance to nearest exit |
| `SafeProgressBonus` | +0.05 | Progress AND new cell has `smoke < 0.5` |
| `DangerPenalty` | −0.50 | `move` into cell with `smoke ≥ 0.5` OR adjacent to `fire ≥ 0.3` |
| `HealthDrainPenalty` | −0.02 × dmg | Proportional to HP lost this step |
| `StrategicDoorBonus` | +0.50 | Closed a door with a cardinal neighbor `fire ≥ 0.3`; once per door per episode |
| `ExplorationBonus` | +0.02 | `move` to a cell not visited this episode |
### Episode-end rubrics (fire only when `done=True`)
| Class | Value | Condition |
|---|---|---|
| `SelfSurviveBonus` | +5.0 | Agent evacuated alive |
| `HealthSurvivalBonus` | +1.5 × (hp/100) | Agent evacuated (range 0 → +1.5) |
| `SelfDeathPenalty` | −10.0 | Agent died (HP ≤ 0) |
| `TimeoutPenalty` | −5.0 to −8.0 | Alive but out of steps; scaled by `−5 − 3×(hp/100)` when exits were reachable |
| `NearMissBonus` | max(0, 3.0 − 0.5 × min_BFS_dist) | On death only; `min_BFS_dist` = closest BFS distance to any exit reached this episode |
| `TimeBonus` | +0.05 × remaining_steps | Agent evacuated |
**BFS note:** `ProgressReward`, `ProgressRegressionPenalty`, `SafeProgressBonus`, and `NearMissBonus` all use true BFS traversal distance (walls and obstacles block; closed doors are treated as passable so the reward models optimal reachability assuming doors can be opened). The PPO trainer’s exit “pull” (below) uses **Manhattan** distance to listed exit cells only for an extra shaping signal — it is not part of the environment rubric.
### PPO training script only (`training/ppo/train_torch_ppo.py`)
The server’s step reward above is **further adjusted** inside the training loop (not returned to HTTP clients): **−0.05** on `wait`; **−0.15** after a `move` if any cardinal neighbor has fire **> 0.15**; **−0.20** if the new position was already in the last **12** positions this episode; **+ max(0, 0.25 − 0.04 × d)** on `move` when not yet evacuated, where **d** is **Manhattan** distance to the nearest cell in `map_state.exit_positions`.
---
## Difficulty presets
| Level | Sources | Spread rate | Humidity | Wind | Max steps | Map |
|---|---|---|---|---|---|---|
| `easy` | 1 | 10–20% | 30–50% | CALM only | 200 | Fixed 16×16 templates |
| `medium` | 2–4 | 15–40% | 10–45% | Any | 150 | Fixed 16×16 templates |
| `hard` | 3–5 | 30–55% | 5–20% | Never CALM | 100 | Procedural 20×24 |
**Health damage rates** (applied after fire sim step):
| Condition | HP/step |
|---|---|
| Light smoke (`0.2–0.5`) | 0.5 |
| Moderate smoke (`0.5–0.8`) | 2.0 |
| Heavy smoke (`≥ 0.8`) | 5.0 |
| On fire (`fire ≥ 0.3`) | 10.0 |
Smoke and fire damage stack if both conditions apply.
---
## HTTP API
The FastAPI server exposes both the standard OpenEnv routes and additional endpoints:
| Method | Path | Body | Returns |
|---|---|---|---|
| `GET` | `/health` | — | `{"status": "ok"}` |
| `POST` | `/reset` | `{"difficulty": "medium", "seed": null}` | `{observation, reward, done, metadata}` |
| `POST` | `/step` | `{"action": "move", "direction": "north"}` | `{observation, reward, done, metadata}` |
| `GET` | `/state` | — | Full `PyreState` dump |
| `GET` | `/scene` | — | Structured scene graph for UI renderers |
| `GET` | `/` | — | Frontend `index.html` |
`/scene` returns a 5-channel per-cell tensor (`cell_type`, `fire`, `smoke`, `is_agent`, `is_visible`) plus structured `labels` (agent position/health/location, episode params, door registry) — consumed by the React frontend.
---
## Training
Three training surfaces share the same PPO algorithm core from `train_torch_ppo.py`:
### 1. In-process (fastest)
```bash
python training/ppo/train_torch_ppo.py \
--episodes 500 \
--device cuda \
--difficulty-schedule easy,medium,hard \
--patience-threshold 0.65 \
--output artifacts/pyre_ppo.pt
```
### 2. HTTP (against live server)
`train_torch_ppo_http.py` is a thin wrapper; it runs `train_torch_ppo.py` with `--server http://localhost:8000` and passes through any other flags.
```bash
# Start server first
uv run server
# Equivalent: python training/ppo/train_torch_ppo.py --server http://localhost:8000 --episodes 300
python training/ppo/train_torch_ppo_http.py --episodes 300
```
### 3. Colab notebook (against HF Space)
Open [`training/ppo/pyre_ppo_training.ipynb`](training/ppo/pyre_ppo_training.ipynb) or the [hosted Colab](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing). The notebook points `SERVER_URL` at `https://krooz-pyre-env.hf.space` and trains entirely over HTTP.
### Observation encoding
The `ObservationEncoder` in `train_torch_ppo.py` encodes each `PyreObservation` into a **5,790-dim float32 vector** (`ObservationEncoder.base_dim`):
```
Grid: 24×24 × 10 channels = 5,760
• 6 one-hot cell type (floor/wall/door_open/door_closed/exit/obstacle)
• fire intensity [0,1]
• smoke density [0,1]
• visibility mask (1=visible)
• agent position mask
Scalars: 17 global features
health, step_progress, fire_spread_rate, humidity,
agent_x_norm, agent_y_norm, nearest_exit_distance,
reachable_exit_count, visible_cell_count, fire_sources,
smoke_severity, alive, evacuated,
exit_dx_norm, exit_dy_norm, exit_manhattan_norm ← exit compass (map-agnostic)
One-hots: wind (5) + difficulty (4: easy, medium, hard_fixed, hard) + route hint (4: N/S/W/E) = 13
Total: 5,760 + 17 + 13 = 5,790
```
With `--history-length 4` (default), four frames are stacked: **input_dim = 23,160**.
### Network architecture
```
Input (23,160)
→ LayerNorm → FC(512) → LayerNorm → ReLU
→ FC(256) → LayerNorm → ReLU
→ FC(128) → ReLU
├── Policy head → FC(37) logits + action mask (−∞ for invalid)
└── Value head → FC(1) scalar
```
The policy uses **37** discrete actions (4 move, 1 wait, 16 door-open, 16 door-close); `look` is not in the PPO head because the map encoder already carries visibility. Orthogonal init (√2 gain hidden layers, 0.01 policy head). Total parameters: **~12.1M** (varies slightly with `hidden-sizes`).
### Default PPO / curriculum flags
- `--difficulty-schedule` default: `easy,medium,hard_fixed,hard` (full three-stage path including two hard modes).
- Patience (defaults): threshold **0.65**, window **15** episodes, optional `--hard-mix-ratio` / `--hard-mix-dist` during the hard stage.
### Curriculum
The `PatienceCurriculum` gating:
```
Stay on current stage until:
success_rate (last 30 eps) ≥ patience_threshold (default 0.65)
for patience_window (default 15) consecutive episodes
→ then advance to the next stage in --difficulty-schedule
Final stage: optional replay of the previous stage (--hard-mix-ratio, default 0.25)
or a custom distribution (--hard-mix-dist), to limit forgetting
```
With the default schedule `easy,medium,hard_fixed,hard`, the “replay” stage during **hard** is typically **`hard_fixed`**, not `medium`. The **`pyre_ppo_hard_v2`** run used an explicit `hard:…,medium:…,easy:…` mix instead; see `training/push_to_hub.py` for the exact flags.
### Push to HuggingFace Hub
```bash
export HF_TOKEN=hf_...
uv run python training/push_to_hub.py \
--repo-id Krooz/pyre-ppo-agent \
--stem pyre_ppo_hard_v2 \
--artifacts-dir artifacts
```
Uploads `{stem}.pt`, `{stem}.csv`, `{stem}.png`, `{stem}_eval.csv`, and generates a model card README. The script’s embedded summary targets the **`pyre_ppo_hard_v2`** HTTP run; adjust `--stem` if you use a different checkpoint prefix. Trained weights: **[Krooz/pyre-ppo-agent](https://huggingface.co/Krooz/pyre-ppo-agent)**.
### Training results
![Pyre PPO — HTTP run `pyre_ppo_hard_v2`, 600 episodes, easy → medium → hard](artifacts/pyre_ppo_hard_v2.png)
**Primary run on record (`artifacts/pyre_ppo_hard_v2.*`): 600 HTTP episodes** with a patience-gated **easy → medium → hard** schedule, eval every 25 episodes on **hard** (see `training/push_to_hub.py` for the exact CLI, metrics table, and hub model-card text). Representative headline numbers from that run: **~55%** final training success rate (MA-20, graph title), **~52.7%** overall evacuation over all 600 episodes, and **~10.5%** evacuation on **hard** episodes within the run—showing the agent still struggles on fully procedural hard maps while improving on easy/medium.
**Earlier ablation (200 episodes, easy → medium only):** a previous curve reached **~75%** success on **medium** after 200 episodes (no `hard` in the training mix). That artifact set is no longer in the tree; the figure and CSV above supersede it for the hackathon write-up.
---
## Frontend
A cinematic real-time visualization built in React 19 + Vite + TypeScript.
```bash
cd frontend
npm install
npm run dev # → http://localhost:5173
```
The map renderer (`src/components/Map2D.tsx`) uses HTML5 Canvas 2D with:
- **5-layer volumetric fire**: dark-red base → orange body → yellow core → white-hot tip → wind-bent plume
- **Ember particle system**: 200-max particles, wind-biased velocity, fade-out
- **Animated walls**: brick texture with heat-tint shift and crack lines near fire
- **Charred obstacles**: dark rubble cells with ember-glow when adjacent to fire
- **Fog-of-war**: per-cell alpha overlay; fire beacon glow punches through fog
- **Minecraft-style agent**: pixel-art character with health-based color theme (blue→orange→red→purple), gold health arc ring, and movement trail
The agent color changes with HP: `healthy (≥60%) → blue`, `moderate (30–59%) → orange`, `low (1–29%) → red`, `critical (≤0%) → purple`.
The right side panel polls `/scene` every 500ms and shows tactical controls, per-door state (open/closed/failed), agent biometrics, environment stats, event log with reward annotations, and raw network activity.
---
## Deployment
```bash
openenv push --repo-id your-org/pyre-env
```
The `openenv.yaml` manifest declares this as a FastAPI space on port 8000. Docker configuration is in `server/Dockerfile`.
---
## Roadmap
The current architecture — cellular automaton physics, composable rubrics, BFS-based visibility, narrative observation layer, dual LLM+RL interface — is designed to generalise. Planned extensions:
### Other natural disasters
The `FireSim` is one implementation of a physics layer. The same environment shell supports alternative calamity models with minimal changes:
| Disaster | Physics swap | New mechanic |
|---|---|---|
| **Flood** | Water pressure + rising level grid | Agent must find high ground or exits before water fills corridors |
| **Earthquake** | Probabilistic wall collapse | Rubble blocks form during episode; structural integrity per cell |
| **Chemical spill** | Wind-borne toxin concentration | Invisible hazard; agent must infer spread direction from health decay |
| **Wildfire (ground level)** | Existing fire sim, outdoor map | No walls, wind-dominated spread, sparse exits |
Each shares the same reward rubric composability, observation layer, and training stack.
### NPC characters
The floor plan templates already define `spawn_zones` and the state model has placeholders for multi-agent positions. Next steps:
- Add panicking civilians who move randomly and block corridors
- Rescue mechanic: escort NPCs to exits for bonus reward
- Theory-of-mind challenge: agent must model NPC movement to plan around them
- Competing agent: second RL agent racing for the same exit (mixed cooperative/competitive)
### 3D maps and multi-floor buildings
- Stack floor levels connected by staircases
- Fire spreads both horizontally and vertically through floor openings
- 3D BFS cone-of-vision observation (currently 2D flood-fill)
- Elevator shafts as high-risk shortcuts
- Procedural multi-floor generator extending the existing Prim-MST approach
### LLM fine-tuning (GRPO)
- `training/` already scaffolds GRPO infrastructure alongside PPO
- Fine-tune a language model's policy directly on Pyre episode rollouts
- Compare: PPO on structured grid vs GRPO on text narrative — does the LLM develop genuine spatial reasoning or pattern-match the narrative?
### Harder curriculum stages
- `extreme` difficulty: procedural map, 5 fire sources, humidity 0–5%, always hurricane-force wind, 75 max steps
- Dynamic difficulty adjustment: real-time difficulty scaling based on agent rolling success rate
- Adversarial fire placement: second agent controls fire source positions to maximise agent failure
---
## Hackathon alignment
- **Theme #2 — Long-Horizon Planning**: 50–200 step episodes; agent must build a mental map across many partial observations with no global state
- **Theme #3.1 — World Modeling**: no global map; agent infers fire spread direction, corridor topology, and exit reachability from local first-person text observations alone