Spaces:

Krooz
/

pyre_env

Sleeping

App Files Files Community

pyre_env / README.md

Krooz

Upload folder using huggingface_hub

f5a6010 verified about 1 month ago

preview code

raw

history blame contribute delete

24.5 kB

metadata

title: Pyre — Crisis Navigation Environment
emoji: 🔥
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv

Pyre — Crisis Navigation Environment for LLM Agents

When buildings burn, the difference between a safe evacuation and a tragedy is the quality of decisions made in the first 60 seconds. Can we train an LLM to make them?

Pyre places an LLM agent inside a burning building. The agent must navigate to safety under partial observability — no global map, a real health system, hard time pressure, and a fire that actively spreads, blocks exits, and permanently alters the floor plan.

Links: 🔥 Live Space | 🤖 Trained Model | 📓 Colab Training | 📝 Blog

Why Pyre vs. existing environments

Feature	`grid_world`	`maze_env`	`wildfire_env`	Pyre
Observability	Full	Full	Partial	Partial, first-person, text
Map dynamics	Static	Static	Dynamic (fire)	Dynamic (fire + doors + burnout)
Action richness	4 moves	4 moves	Suppression	Movement + door control + look
Agent role	Mover	Mover	Suppressor	Survivor
Reward complexity	Reach goal	Reach goal	Suppress fire	14-component composite rubric

wildfire_env trains an agent to fight fires from above; Pyre trains an agent to survive from inside.

Quick start

uv sync
uv run server           # → http://localhost:8000

# Health check
curl http://localhost:8000/health

# Start episode
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "medium"}'

# Take a step
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": "move", "direction": "north"}'

# Random baseline (smoke test)
python examples/random_agent.py --episodes 5 --verbose

Python client

from pyre_env import PyreEnv, PyreAction

with PyreEnv(base_url="http://localhost:8000") as env:
    result = env.reset()
    print(result.observation.narrative)
    result = env.step(PyreAction(action="move", direction="north"))
    print(f"Reward: {result.reward:.3f} | HP: {result.observation.agent_health}")

Environment variables

Variable	Default	Description
`PORT`	`8000`	HTTP server port
`PYRE_MAX_STEPS`	`150`	Default max steps per episode (overridden by difficulty preset)
`PYRE_SEED`	`42`	Base RNG seed; each episode increments by 37
`HF_TOKEN`	—	Required only for `training/push_to_hub.py`

Architecture

reset() / step()
    │
    ▼
PyreEnvironment          server/pyre_env_environment.py
    ├── floor_plan.py    Building template or procedural generation
    ├── fire_sim.py      Cellular automaton: spread → intensity → smoke
    ├── narrative.py     BFS visibility → first-person text + structured fields
    └── rubrics.py       14 composable reward components
    │
    ▼
PyreObservation          models.py
    ├── narrative        str  — primary LLM input
    ├── map_state        PyreMapState — full grid snapshot for RL encoders
    ├── reward           float
    ├── done             bool
    └── metadata         dict — fire params, distances, difficulty

Data flow per step

_execute_action() — move / door / look / wait, returns feedback string
Check evacuation — agent on EXIT cell with fire < 0.5 → success
FireSim.step() — advance fire, smoke, burn timers; may convert cells to OBSTACLE
Apply health damage from smoke (0.5–5 HP/step) and fire (10 HP/step)
_compute_reward() — call all 14 rubrics with shared kwargs
build_narrative_observation() — BFS visibility, compose text, collect action hints
_build_map_state() — assemble full grid snapshot for UI / RL encoder
Return PyreObservation

Project structure

pyre_env/
├── models.py                        PyreAction, PyreObservation, PyreMapState, PyreState
├── client.py                        PyreEnv (EnvClient subclass, narrative-focused)
├── openenv.yaml                     OpenEnv manifest (space, fastapi, port 8000)
├── pyproject.toml
│
├── server/
│   ├── app.py                       FastAPI bootstrap; stateful /reset, /step, /state, /scene
│   ├── pyre_env_environment.py      PyreEnvironment state machine + difficulty presets
│   ├── floor_plan.py                3 hand-authored templates + procedural generator
│   ├── fire_sim.py                  Cellular automaton fire/smoke simulation
│   ├── narrative.py                 BFS visibility + first-person text renderer
│   └── rubrics.py                   14 composable reward rubric classes
│
├── frontend/
│   ├── src/
│   │   ├── App.tsx                  Dashboard shell: topbar, canvas zone, side panel
│   │   ├── components/Map2D.tsx     Canvas2D renderer: fire, smoke, fog-of-war, agent
│   │   ├── components/HUD.tsx       HP bar, wind compass, step counter overlay
│   │   ├── components/ControlPanel.tsx  Move/door controls, difficulty, auto-wait
│   │   └── components/StatusCard.tsx   Agent biometrics, environment stats
│   └── README.md                    Frontend setup and demo script
│
├── training/
│   ├── ppo/
│   │   ├── train_torch_ppo.py       PPO (in-process or `--server` for HTTP EnvClient)
│   │   ├── train_torch_ppo_http.py  Thin wrapper: forwards argv to `train_torch_ppo.py --server ...`
│   │   └── pyre_ppo_training.ipynb  Colab notebook (self-contained, talks to HF Space)
│   └── push_to_hub.py               Upload checkpoint + metrics to HuggingFace Hub
│
├── examples/
│   └── random_agent.py              Baseline: 70% hint-biased, 30% random
│
└── artifacts/                       Training outputs: .pt, .csv, .png

Simulation layer

Fire simulation (`server/fire_sim.py`)

A stochastic cellular automaton over a flat row-major grid. Each call to FireSim.step() runs three phases:

Phase 1 — Ignition. Any cell with fire ≥ FIRE_BURNING (0.3) tries to ignite each cardinal neighbor:

p_ignite = p_spread × (1 − humidity) × wind_multiplier × fuel_map[neighbor]

Wind multiplier: dot product of spread direction with wind vector → downwind 2×, upwind 0.5×, crosswind 1×
Closed doors: DOOR_CLOSED_FIRE_FACTOR = 0.15 (fire crosses at 15% normal rate)
Fuel map: per-cell float from floor_plan.py; office rooms 1.5×, exits 0.6×

Phase 2 — Intensity. Existing fire gains FIRE_INTENSITY_GAIN (0.15) × fuel_map[i] per step. When burn_timer ≥ BURNOUT_TICKS (5) and intensity reaches 1.0, the cell becomes OBSTACLE — permanently impassable rubble.

Phase 3 — Smoke. Smoke is sourced at +0.3/step for cells with fire ≥ 0.3, diffuses between neighbors at SMOKE_SPREAD_RATE (0.20), passes through closed doors at 40% rate, and decays per cell according to ventilation_map.

Key constants:

Constant	Value	Role
`FIRE_IGNITION`	0.1	Starting intensity for new ignitions
`FIRE_BURNING`	0.3	Threshold for spreading and causing damage
`FIRE_INTENSITY_GAIN`	0.15	Intensity added per step to burning cell
`BURNOUT_TICKS`	5	Steps at full intensity before cell → OBSTACLE
`DOOR_CLOSED_FIRE_FACTOR`	0.15	Fire spread multiplier through closed doors
`SMOKE_SPREAD_RATE`	0.20	Smoke diffusion rate between neighbors
`SMOKE_DOOR_FACTOR`	0.40	Smoke rate through closed doors
`EXIT_BLOCKED_FIRE_THRESHOLD`	0.5	Fire intensity at which an exit is considered blocked

Building templates (`server/floor_plan.py`)

Three hand-authored 16×16 templates for easy and medium difficulty:

Template	Layout	Exits	Doors	Notes
`small_office`	Two corridor bands + office rooms N/S	2 (W, E walls)	8 (room↔corridor)	Agent spawns in corridor
`open_plan`	Open hall with 4 × 2×2 pillar obstacles	2 (diagonal corners)	0	High ventilation throughout
`t_corridor`	T-shaped: vertical stem + horizontal bar	3 (top, left, right)	4 (rooms off stem)	Multiple route decisions

Each template carries a zone_map (cell → zone label), derived fuel_map, and ventilation_map:

Zone	Fuel multiplier	Smoke decay/step	Notes
`north/south_offices`	1.5×	0.010	High fuel, poor ventilation
`west/east_rooms`	1.5×	0.010	Same as offices
`main_corridor`	1.0×	0.028	Baseline
`northwest/northeast/etc. hall`	0.9×	0.050	Open plan — best ventilation
`exit`	0.6×	0.040	Concrete, vented

Hard mode — procedural generation. Episodes run on a freshly generated 20×24 floor plan every time:

Room placement: random non-overlapping rectangles (3–5 × 3–4 cells, 6–10 rooms)
MST corridors: Prim-style minimum spanning tree connecting room centers via L-shaped tunnels
Exit placement: deterministic tunnels from leftmost/rightmost floor cells to outer walls
Connectivity guard: BFS from agent spawn verifies ≥1 exit is reachable; up to 3 attempts; falls back to small_office

Visibility (`server/narrative.py`)

BFS flood-fill from agent position, walls block expansion:

Agent smoke level	Visibility radius
None / light (`< 0.5`)	5 cells
Moderate (`0.5–0.8`)	3 cells
Heavy (`≥ 0.8`)	2 cells

What the agent sees

Every step, narrative.py assembles a first-person text observation from raw grid state:

You are in the **main_corridor**. The air is **moderate**.
Health: ████████░░ (85/100) | Wind: **EAST**
Flames are visible to the **west**.
Exits visible: exit_0_7 at 8m west.
Doors: door_1 (closed) at 2m east.
You hear: Fire alarm sounding; Smoke detector beeping.
Last action: You move south. The smoke is thick here.
Available actions: move(direction='north')  move(direction='south')
                   door(target_id='door_1', door_state='open')  look(direction='east')  wait()

The same state is also exposed as structured fields in PyreObservation (smoke level, fire direction, visible objects, blocked exits, action hints) and as a full grid snapshot in PyreObservation.map_state for programmatic / RL use.

Action space

Action	Parameters	Effect
`move`	`direction: north\|south\|east\|west`	Move one cell; blocked by walls, obstacles, closed doors
`door`	`target_id: str`, `door_state: open\|close`	Open or close a door within 2 cells Manhattan distance
`look`	`direction: north\|south\|east\|west`	Ray-scan up to 5 cells; returns per-cell smoke/fire/zone/door/exit detail. Time still advances.
`wait`	—	Skip turn

Reward function — all 14 components

Per-step rubrics

Class	Value	Condition
`TimeStepPenalty`	−0.01	Every step
`ProgressReward`	+0.25	`move` reduced BFS distance to nearest unblocked exit
`ProgressRegressionPenalty`	−0.15	`move` increased BFS distance to nearest exit
`SafeProgressBonus`	+0.05	Progress AND new cell has `smoke < 0.5`
`DangerPenalty`	−0.50	`move` into cell with `smoke ≥ 0.5` OR adjacent to `fire ≥ 0.3`
`HealthDrainPenalty`	−0.02 × dmg	Proportional to HP lost this step
`StrategicDoorBonus`	+0.50	Closed a door with a cardinal neighbor `fire ≥ 0.3`; once per door per episode
`ExplorationBonus`	+0.02	`move` to a cell not visited this episode

Episode-end rubrics (fire only when `done=True`)

Class	Value	Condition
`SelfSurviveBonus`	+5.0	Agent evacuated alive
`HealthSurvivalBonus`	+1.5 × (hp/100)	Agent evacuated (range 0 → +1.5)
`SelfDeathPenalty`	−10.0	Agent died (HP ≤ 0)
`TimeoutPenalty`	−5.0 to −8.0	Alive but out of steps; scaled by `−5 − 3×(hp/100)` when exits were reachable
`NearMissBonus`	max(0, 3.0 − 0.5 × min_BFS_dist)	On death only; `min_BFS_dist` = closest BFS distance to any exit reached this episode
`TimeBonus`	+0.05 × remaining_steps	Agent evacuated

BFS note: ProgressReward, ProgressRegressionPenalty, SafeProgressBonus, and NearMissBonus all use true BFS traversal distance (walls and obstacles block; closed doors are treated as passable so the reward models optimal reachability assuming doors can be opened). The PPO trainer’s exit “pull” (below) uses Manhattan distance to listed exit cells only for an extra shaping signal — it is not part of the environment rubric.

PPO training script only (`training/ppo/train_torch_ppo.py`)

The server’s step reward above is further adjusted inside the training loop (not returned to HTTP clients): −0.05 on wait; −0.15 after a move if any cardinal neighbor has fire > 0.15; −0.20 if the new position was already in the last 12 positions this episode; + max(0, 0.25 − 0.04 × d) on move when not yet evacuated, where d is Manhattan distance to the nearest cell in map_state.exit_positions.

Difficulty presets

Level	Sources	Spread rate	Humidity	Wind	Max steps	Map
`easy`	1	10–20%	30–50%	CALM only	200	Fixed 16×16 templates
`medium`	2–4	15–40%	10–45%	Any	150	Fixed 16×16 templates
`hard`	3–5	30–55%	5–20%	Never CALM	100	Procedural 20×24

Health damage rates (applied after fire sim step):

Condition	HP/step
Light smoke (`0.2–0.5`)	0.5
Moderate smoke (`0.5–0.8`)	2.0
Heavy smoke (`≥ 0.8`)	5.0
On fire (`fire ≥ 0.3`)	10.0

Smoke and fire damage stack if both conditions apply.

HTTP API

The FastAPI server exposes both the standard OpenEnv routes and additional endpoints:

Method	Path	Body	Returns
`GET`	`/health`	—	`{"status": "ok"}`
`POST`	`/reset`	`{"difficulty": "medium", "seed": null}`	`{observation, reward, done, metadata}`
`POST`	`/step`	`{"action": "move", "direction": "north"}`	`{observation, reward, done, metadata}`
`GET`	`/state`	—	Full `PyreState` dump
`GET`	`/scene`	—	Structured scene graph for UI renderers
`GET`	`/`	—	Frontend `index.html`

/scene returns a 5-channel per-cell tensor (cell_type, fire, smoke, is_agent, is_visible) plus structured labels (agent position/health/location, episode params, door registry) — consumed by the React frontend.

Training

Three training surfaces share the same PPO algorithm core from train_torch_ppo.py:

1. In-process (fastest)

python training/ppo/train_torch_ppo.py \
  --episodes 500 \
  --device cuda \
  --difficulty-schedule easy,medium,hard \
  --patience-threshold 0.65 \
  --output artifacts/pyre_ppo.pt

2. HTTP (against live server)

train_torch_ppo_http.py is a thin wrapper; it runs train_torch_ppo.py with --server http://localhost:8000 and passes through any other flags.

# Start server first
uv run server

# Equivalent: python training/ppo/train_torch_ppo.py --server http://localhost:8000 --episodes 300
python training/ppo/train_torch_ppo_http.py --episodes 300

3. Colab notebook (against HF Space)

Open training/ppo/pyre_ppo_training.ipynb or the hosted Colab. The notebook points SERVER_URL at https://krooz-pyre-env.hf.space and trains entirely over HTTP.

Observation encoding

The ObservationEncoder in train_torch_ppo.py encodes each PyreObservation into a 5,790-dim float32 vector (ObservationEncoder.base_dim):

Grid:    24×24 × 10 channels = 5,760
         • 6 one-hot cell type (floor/wall/door_open/door_closed/exit/obstacle)
         • fire intensity [0,1]
         • smoke density  [0,1]
         • visibility mask (1=visible)
         • agent position mask

Scalars: 17 global features
         health, step_progress, fire_spread_rate, humidity,
         agent_x_norm, agent_y_norm, nearest_exit_distance,
         reachable_exit_count, visible_cell_count, fire_sources,
         smoke_severity, alive, evacuated,
         exit_dx_norm, exit_dy_norm, exit_manhattan_norm   ← exit compass (map-agnostic)

One-hots: wind (5) + difficulty (4: easy, medium, hard_fixed, hard) + route hint (4: N/S/W/E) = 13

Total: 5,760 + 17 + 13 = 5,790

With --history-length 4 (default), four frames are stacked: input_dim = 23,160.

Network architecture

Input (23,160)
  → LayerNorm → FC(512) → LayerNorm → ReLU
  → FC(256)   → LayerNorm → ReLU
  → FC(128)   → ReLU
       ├── Policy head → FC(37) logits + action mask (−∞ for invalid)
       └── Value head  → FC(1) scalar

The policy uses 37 discrete actions (4 move, 1 wait, 16 door-open, 16 door-close); look is not in the PPO head because the map encoder already carries visibility. Orthogonal init (√2 gain hidden layers, 0.01 policy head). Total parameters: ~12.1M (varies slightly with hidden-sizes).

Default PPO / curriculum flags

--difficulty-schedule default: easy,medium,hard_fixed,hard (full three-stage path including two hard modes).
Patience (defaults): threshold 0.65, window 15 episodes, optional --hard-mix-ratio / --hard-mix-dist during the hard stage.

Curriculum

The PatienceCurriculum gating:

Stay on current stage until:
  success_rate (last 30 eps) ≥ patience_threshold (default 0.65)
  for patience_window (default 15) consecutive episodes
→ then advance to the next stage in --difficulty-schedule

Final stage: optional replay of the previous stage (--hard-mix-ratio, default 0.25)
             or a custom distribution (--hard-mix-dist), to limit forgetting

With the default schedule easy,medium,hard_fixed,hard, the “replay” stage during hard is typically hard_fixed, not medium. The pyre_ppo_hard_v2 run used an explicit hard:…,medium:…,easy:… mix instead; see training/push_to_hub.py for the exact flags.

Push to HuggingFace Hub

export HF_TOKEN=hf_...
uv run python training/push_to_hub.py \
  --repo-id Krooz/pyre-ppo-agent \
  --stem pyre_ppo_hard_v2 \
  --artifacts-dir artifacts

Uploads {stem}.pt, {stem}.csv, {stem}.png, {stem}_eval.csv, and generates a model card README. The script’s embedded summary targets the pyre_ppo_hard_v2 HTTP run; adjust --stem if you use a different checkpoint prefix. Trained weights: Krooz/pyre-ppo-agent.

Training results

Primary run on record (artifacts/pyre_ppo_hard_v2.*): 600 HTTP episodes with a patience-gated easy → medium → hard schedule, eval every 25 episodes on hard (see training/push_to_hub.py for the exact CLI, metrics table, and hub model-card text). Representative headline numbers from that run: ~55% final training success rate (MA-20, graph title), ~52.7% overall evacuation over all 600 episodes, and ~10.5% evacuation on hard episodes within the run—showing the agent still struggles on fully procedural hard maps while improving on easy/medium.

Earlier ablation (200 episodes, easy → medium only): a previous curve reached ~75% success on medium after 200 episodes (no hard in the training mix). That artifact set is no longer in the tree; the figure and CSV above supersede it for the hackathon write-up.

Frontend

A cinematic real-time visualization built in React 19 + Vite + TypeScript.

cd frontend
npm install
npm run dev     # → http://localhost:5173

The map renderer (src/components/Map2D.tsx) uses HTML5 Canvas 2D with:

5-layer volumetric fire: dark-red base → orange body → yellow core → white-hot tip → wind-bent plume
Ember particle system: 200-max particles, wind-biased velocity, fade-out
Animated walls: brick texture with heat-tint shift and crack lines near fire
Charred obstacles: dark rubble cells with ember-glow when adjacent to fire
Fog-of-war: per-cell alpha overlay; fire beacon glow punches through fog
Minecraft-style agent: pixel-art character with health-based color theme (blue→orange→red→purple), gold health arc ring, and movement trail

The agent color changes with HP: healthy (≥60%) → blue, moderate (30–59%) → orange, low (1–29%) → red, critical (≤0%) → purple.

The right side panel polls /scene every 500ms and shows tactical controls, per-door state (open/closed/failed), agent biometrics, environment stats, event log with reward annotations, and raw network activity.

Deployment

openenv push --repo-id your-org/pyre-env

The openenv.yaml manifest declares this as a FastAPI space on port 8000. Docker configuration is in server/Dockerfile.

Roadmap

The current architecture — cellular automaton physics, composable rubrics, BFS-based visibility, narrative observation layer, dual LLM+RL interface — is designed to generalise. Planned extensions:

Other natural disasters

The FireSim is one implementation of a physics layer. The same environment shell supports alternative calamity models with minimal changes:

Disaster	Physics swap	New mechanic
Flood	Water pressure + rising level grid	Agent must find high ground or exits before water fills corridors
Earthquake	Probabilistic wall collapse	Rubble blocks form during episode; structural integrity per cell
Chemical spill	Wind-borne toxin concentration	Invisible hazard; agent must infer spread direction from health decay
Wildfire (ground level)	Existing fire sim, outdoor map	No walls, wind-dominated spread, sparse exits

Each shares the same reward rubric composability, observation layer, and training stack.

NPC characters

The floor plan templates already define spawn_zones and the state model has placeholders for multi-agent positions. Next steps:

Add panicking civilians who move randomly and block corridors
Rescue mechanic: escort NPCs to exits for bonus reward
Theory-of-mind challenge: agent must model NPC movement to plan around them
Competing agent: second RL agent racing for the same exit (mixed cooperative/competitive)

3D maps and multi-floor buildings

Stack floor levels connected by staircases
Fire spreads both horizontally and vertically through floor openings
3D BFS cone-of-vision observation (currently 2D flood-fill)
Elevator shafts as high-risk shortcuts
Procedural multi-floor generator extending the existing Prim-MST approach

LLM fine-tuning (GRPO)

training/ already scaffolds GRPO infrastructure alongside PPO
Fine-tune a language model's policy directly on Pyre episode rollouts
Compare: PPO on structured grid vs GRPO on text narrative — does the LLM develop genuine spatial reasoning or pattern-match the narrative?

Harder curriculum stages

extreme difficulty: procedural map, 5 fire sources, humidity 0–5%, always hurricane-force wind, 75 max steps
Dynamic difficulty adjustment: real-time difficulty scaling based on agent rolling success rate
Adversarial fire placement: second agent controls fire source positions to maximise agent failure

Hackathon alignment

Theme #2 — Long-Horizon Planning: 50–200 step episodes; agent must build a mental map across many partial observations with no global state
Theme #3.1 — World Modeling: no global map; agent infers fire spread direction, corridor topology, and exit reachability from local first-person text observations alone