---
tags:
  - openenv
  - reinforcement-learning
  - ppo
  - pyre
  - fire-evacuation
license: mit
---

# Pyre PPO Agent — `krooz/pyre-ppo-agent`

PPO-trained actor-critic agent for the [Pyre](https://huggingface.co/spaces/Krooz/pyre_env)
fire-evacuation environment (OpenEnv Hackathon, Apr 2026).

> ⚠️ This is a raw PyTorch checkpoint, **not** a `transformers` model.
> The Hugging Face hosted Inference API cannot run it directly.
> Use the inference code below to load and run it locally.

## Training summary (artifact run: ``pyre_ppo_hard_v2``)

Values below are from ``artifacts/pyre_ppo_hard_v2.csv``, ``pyre_ppo_hard_v2_eval.csv``,
``pyre_ppo_hard_v2.png`` (MA-**20** curves match ``save_training_graph_png`` in ``train_torch_ppo.py``),
and ``artifacts/pyre_ppo_hard_v2_training.log`` (HTTP trainer via ``train_torch_ppo_http.py``, env at ``http://localhost:8000``).

| Metric | Value |
|--------|-------|
| Total episodes | **600** |
| Wall-clock training time | **~227 s** (~2.6 eps/s) |
| Final success rate (MA-20, training graph title) | **55%** |
| Final reward mean (MA-20) | **+3.21** |
| Final success rate (rolling last 30 ep, CSV ``s30`` / log) | **47%** |
| Overall evacuation rate (all 600 ep, CSV) | **52.7%** |
| Per-difficulty evacuation (easy / medium / hard) | **67.7%** / **59.5%** / **10.5%** |
| Curriculum | **easy → medium → hard** with patience gate (**0.70** over **20** ep); hard-phase mix **hard:0.4, medium:0.4, easy:0.2** |
| Eval cadence | Every **25** episodes, **5** deterministic rollouts |
| Eval difficulty | **hard** (``pyre_ppo_hard_v2_eval.csv``) |

### Training command (this run)

```bash
uv run python training/ppo/train_torch_ppo_http.py \
  --episodes 600 \
  --difficulty-schedule easy,medium,hard \
  --patience-threshold 0.70 \
  --patience-window 20 \
  --hard-mix-dist hard:0.4,medium:0.4,easy:0.2 \
  --update-every 8 \
  --update-epochs 6 \
  --eval-every 25 \
  --eval-difficulty hard \
  --eval-episodes 5 \
  --checkpoint-every 50 \
  --entropy-coef 0.05 \
  --step-delay 0 \
  --viz-after-ep 500 \
  --output artifacts/pyre_ppo_hard_v2.pt \
  --log-file artifacts/pyre_ppo_hard_v2_training.log
```

## Network architecture (from training log)

| Property | Value |
|----------|-------|
| Total parameters | **12,065,650** |
| Input vector dim | **23,140** (encoder ``base_dim`` 5785 × **4** stacked frames) |
| Action dim | **41** (4 move + 4 look + 1 wait + 16 door open + 16 door close) |
| Hidden MLP | **512 → 256 → 128** |

## Hyperparameters (this run)

| Param | Value |
|-------|-------|
| Learning rate | **3×10⁻⁴** (with LR decay toward **0.1×** end factor unless disabled) |
| PPO clip ε | **0.2** |
| Entropy coeff | **0.05** |
| Value coeff | **0.5** |
| Gamma | **0.99** |
| GAE λ | **0.95** |
| PPO update every | **8** episodes |
| PPO epochs / minibatch | **6** / **256** |
| Max grad norm | **0.5** |
| Observation mode | **visible** (partial observability) |
| Device | **cuda** (``train_torch_ppo.py`` default; set ``--device cpu`` if needed) |

### Periodic eval on **hard** (from ``pyre_ppo_hard_v2_eval.csv``)

| Episode | Difficulty | Success rate | Reward mean | Steps mean |
|---------|------------|--------------|-------------|------------|
| 25 | hard | 0% | −10.124 | 58.0 |
| 50 | hard | 0% | −11.184 | 58.4 |
| 75 | hard | 0% | −11.468 | 35.6 |
| 100 | hard | 0% | −9.827 | 74.0 |
| 125 | hard | 20% | −7.792 | 25.0 |
| 150 | hard | 40% | −4.238 | 28.0 |
| 175 | hard | 20% | −6.674 | 35.2 |
| 200 | hard | 0% | −12.304 | 74.6 |
| 225 | hard | 0% | −11.080 | 100.0 |
| 250 | hard | 20% | −5.648 | 38.4 |
| 275 | hard | 0% | −10.368 | 76.2 |
| 300 | hard | 20% | −4.421 | 72.8 |
| 325 | hard | 0% | −11.180 | 48.2 |
| 350 | hard | 0% | −9.845 | 74.0 |
| 375 | hard | 0% | −11.320 | 26.4 |
| 400 | hard | 0% | −12.256 | 34.0 |
| 425 | hard | 20% | −7.024 | 36.4 |
| 450 | hard | 0% | −10.726 | 56.4 |
| 475 | hard | 0% | −9.072 | 88.6 |
| 500 | hard | 0% | −12.050 | 66.6 |
| 525 | hard | 20% | −5.528 | 41.6 |
| 550 | hard | 0% | −11.274 | 52.4 |
| 575 | hard | 0% | −10.578 | 58.4 |
| 600 | hard | 0% | −12.068 | 36.6 |

## Files in this repository

| File | Description |
|------|-------------|
| `model.pt` | PyTorch checkpoint (`network_state`, `optimizer_state`, `scheduler_state`, `args`, `episode`) |
| `training_graph.png` | Training curves (reward + success rate vs episode) |
| `episode_metrics.csv` | Per-episode training metrics |
| `eval_metrics.csv` | Periodic eval aggregates |
| `training.log` | Full console transcript of the HTTP training run |

## Running inference locally

```python
import sys
import torch
from huggingface_hub import hf_hub_download

# 1. Point Python at your local pyre_env checkout (or install the package)
sys.path.insert(0, "pyre_env")

from training.ppo.train_torch_ppo import (
    ActorCritic,
    ObservationEncoder,
    action_index_to_env_action,
    build_action_mask,
)

# 2. Download the checkpoint from this Hub repo
ckpt_path = hf_hub_download(repo_id="krooz/pyre-ppo-agent", filename="model.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)

# 3. Rebuild the policy from saved training args
saved_args = ckpt["args"]
encoder = ObservationEncoder(mode=saved_args.get("observation_mode", "visible"))
hidden_sizes = tuple(int(x) for x in saved_args.get("hidden_sizes", "512,256,128").split(","))
history_length = saved_args.get("history_length", 4)
input_dim = encoder.base_dim * history_length
network = ActorCritic(input_dim, 41, hidden_sizes)
network.load_state_dict(ckpt["network_state"])
network.eval()
print(f"Loaded checkpoint from episode {ckpt.get('episode', '?')}")

# 4. Roll out one episode (in-process env — swap for HTTP client if you prefer)
from openenv_pyre import PyreEnvironment
from collections import deque
import numpy as np

env = PyreEnvironment()
obs = env.reset(difficulty="medium")
frames = deque([np.zeros(encoder.base_dim, dtype=np.float32)] * history_length, maxlen=history_length)
frames.append(encoder.encode(obs))

total_reward = 0.0
with torch.no_grad():
    while True:
        state_vec = np.concatenate(list(frames), dtype=np.float32)
        obs_t = torch.tensor(state_vec, dtype=torch.float32).unsqueeze(0)
        mask_t = torch.tensor(build_action_mask(obs, exclude_look=True), dtype=torch.float32).unsqueeze(0)
        action_t, _, _ = network.act(obs_t, mask_t, deterministic=True)
        obs = env.step(action_index_to_env_action(int(action_t.item())))
        total_reward += float(obs.reward or 0.0)
        frames.append(encoder.encode(obs))
        if obs.done:
            break

print(f"Episode finished — evacuated={obs.agent_evacuated}  reward={total_reward:.3f}")
```

## Environment & training resources

- **HF Space (live env)**: [Krooz/pyre_env](https://huggingface.co/spaces/Krooz/pyre_env)
- **PPO training in Colab (HTTP to Space)**: [Pyre PPO training — Google Colab](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing)
- **Local HTTP trainer**: ``training/ppo/train_torch_ppo_http.py``
- **Local in-process trainer**: ``training/ppo/train_torch_ppo.py``
- **Notebook source**: ``training/ppo/pyre_ppo_training.ipynb``