File size: 7,262 Bytes
143ab3d c0c8333 143ab3d 4de3fe2 143ab3d 4de3fe2 c0c8333 4de3fe2 c0c8333 143ab3d c0c8333 4de3fe2 143ab3d 4de3fe2 c0c8333 143ab3d c0c8333 4de3fe2 c0c8333 4de3fe2 c0c8333 4de3fe2 c0c8333 143ab3d c0c8333 143ab3d 4de3fe2 c0c8333 143ab3d 4de3fe2 143ab3d 4de3fe2 143ab3d 4de3fe2 143ab3d 4de3fe2 c0c8333 4de3fe2 143ab3d 4de3fe2 143ab3d 4de3fe2 143ab3d 4de3fe2 937634b 4de3fe2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | ---
tags:
- openenv
- reinforcement-learning
- ppo
- pyre
- fire-evacuation
license: mit
---
# Pyre PPO Agent β `krooz/pyre-ppo-agent`
PPO-trained actor-critic agent for the [Pyre](https://huggingface.co/spaces/Krooz/pyre_env)
fire-evacuation environment (OpenEnv Hackathon, Apr 2026).
> β οΈ This is a raw PyTorch checkpoint, **not** a `transformers` model.
> The Hugging Face hosted Inference API cannot run it directly.
> Use the inference code below to load and run it locally.
## Training summary (artifact run: ``pyre_ppo_hard_v2``)
Values below are from ``artifacts/pyre_ppo_hard_v2.csv``, ``pyre_ppo_hard_v2_eval.csv``,
``pyre_ppo_hard_v2.png`` (MA-**20** curves match ``save_training_graph_png`` in ``train_torch_ppo.py``),
and ``artifacts/pyre_ppo_hard_v2_training.log`` (HTTP trainer via ``train_torch_ppo_http.py``, env at ``http://localhost:8000``).
| Metric | Value |
|--------|-------|
| Total episodes | **600** |
| Wall-clock training time | **~227 s** (~2.6 eps/s) |
| Final success rate (MA-20, training graph title) | **55%** |
| Final reward mean (MA-20) | **+3.21** |
| Final success rate (rolling last 30 ep, CSV ``s30`` / log) | **47%** |
| Overall evacuation rate (all 600 ep, CSV) | **52.7%** |
| Per-difficulty evacuation (easy / medium / hard) | **67.7%** / **59.5%** / **10.5%** |
| Curriculum | **easy β medium β hard** with patience gate (**0.70** over **20** ep); hard-phase mix **hard:0.4, medium:0.4, easy:0.2** |
| Eval cadence | Every **25** episodes, **5** deterministic rollouts |
| Eval difficulty | **hard** (``pyre_ppo_hard_v2_eval.csv``) |
### Training command (this run)
```bash
uv run python training/ppo/train_torch_ppo_http.py \
--episodes 600 \
--difficulty-schedule easy,medium,hard \
--patience-threshold 0.70 \
--patience-window 20 \
--hard-mix-dist hard:0.4,medium:0.4,easy:0.2 \
--update-every 8 \
--update-epochs 6 \
--eval-every 25 \
--eval-difficulty hard \
--eval-episodes 5 \
--checkpoint-every 50 \
--entropy-coef 0.05 \
--step-delay 0 \
--viz-after-ep 500 \
--output artifacts/pyre_ppo_hard_v2.pt \
--log-file artifacts/pyre_ppo_hard_v2_training.log
```
## Network architecture (from training log)
| Property | Value |
|----------|-------|
| Total parameters | **12,065,650** |
| Input vector dim | **23,140** (encoder ``base_dim`` 5785 Γ **4** stacked frames) |
| Action dim | **41** (4 move + 4 look + 1 wait + 16 door open + 16 door close) |
| Hidden MLP | **512 β 256 β 128** |
## Hyperparameters (this run)
| Param | Value |
|-------|-------|
| Learning rate | **3Γ10β»β΄** (with LR decay toward **0.1Γ** end factor unless disabled) |
| PPO clip Ξ΅ | **0.2** |
| Entropy coeff | **0.05** |
| Value coeff | **0.5** |
| Gamma | **0.99** |
| GAE Ξ» | **0.95** |
| PPO update every | **8** episodes |
| PPO epochs / minibatch | **6** / **256** |
| Max grad norm | **0.5** |
| Observation mode | **visible** (partial observability) |
| Device | **cuda** (``train_torch_ppo.py`` default; set ``--device cpu`` if needed) |
### Periodic eval on **hard** (from ``pyre_ppo_hard_v2_eval.csv``)
| Episode | Difficulty | Success rate | Reward mean | Steps mean |
|---------|------------|--------------|-------------|------------|
| 25 | hard | 0% | β10.124 | 58.0 |
| 50 | hard | 0% | β11.184 | 58.4 |
| 75 | hard | 0% | β11.468 | 35.6 |
| 100 | hard | 0% | β9.827 | 74.0 |
| 125 | hard | 20% | β7.792 | 25.0 |
| 150 | hard | 40% | β4.238 | 28.0 |
| 175 | hard | 20% | β6.674 | 35.2 |
| 200 | hard | 0% | β12.304 | 74.6 |
| 225 | hard | 0% | β11.080 | 100.0 |
| 250 | hard | 20% | β5.648 | 38.4 |
| 275 | hard | 0% | β10.368 | 76.2 |
| 300 | hard | 20% | β4.421 | 72.8 |
| 325 | hard | 0% | β11.180 | 48.2 |
| 350 | hard | 0% | β9.845 | 74.0 |
| 375 | hard | 0% | β11.320 | 26.4 |
| 400 | hard | 0% | β12.256 | 34.0 |
| 425 | hard | 20% | β7.024 | 36.4 |
| 450 | hard | 0% | β10.726 | 56.4 |
| 475 | hard | 0% | β9.072 | 88.6 |
| 500 | hard | 0% | β12.050 | 66.6 |
| 525 | hard | 20% | β5.528 | 41.6 |
| 550 | hard | 0% | β11.274 | 52.4 |
| 575 | hard | 0% | β10.578 | 58.4 |
| 600 | hard | 0% | β12.068 | 36.6 |
## Files in this repository
| File | Description |
|------|-------------|
| `model.pt` | PyTorch checkpoint (`network_state`, `optimizer_state`, `scheduler_state`, `args`, `episode`) |
| `training_graph.png` | Training curves (reward + success rate vs episode) |
| `episode_metrics.csv` | Per-episode training metrics |
| `eval_metrics.csv` | Periodic eval aggregates |
| `training.log` | Full console transcript of the HTTP training run |
## Running inference locally
```python
import sys
import torch
from huggingface_hub import hf_hub_download
# 1. Point Python at your local pyre_env checkout (or install the package)
sys.path.insert(0, "pyre_env")
from training.ppo.train_torch_ppo import (
ActorCritic,
ObservationEncoder,
action_index_to_env_action,
build_action_mask,
)
# 2. Download the checkpoint from this Hub repo
ckpt_path = hf_hub_download(repo_id="krooz/pyre-ppo-agent", filename="model.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
# 3. Rebuild the policy from saved training args
saved_args = ckpt["args"]
encoder = ObservationEncoder(mode=saved_args.get("observation_mode", "visible"))
hidden_sizes = tuple(int(x) for x in saved_args.get("hidden_sizes", "512,256,128").split(","))
history_length = saved_args.get("history_length", 4)
input_dim = encoder.base_dim * history_length
network = ActorCritic(input_dim, 41, hidden_sizes)
network.load_state_dict(ckpt["network_state"])
network.eval()
print(f"Loaded checkpoint from episode {ckpt.get('episode', '?')}")
# 4. Roll out one episode (in-process env β swap for HTTP client if you prefer)
from openenv_pyre import PyreEnvironment
from collections import deque
import numpy as np
env = PyreEnvironment()
obs = env.reset(difficulty="medium")
frames = deque([np.zeros(encoder.base_dim, dtype=np.float32)] * history_length, maxlen=history_length)
frames.append(encoder.encode(obs))
total_reward = 0.0
with torch.no_grad():
while True:
state_vec = np.concatenate(list(frames), dtype=np.float32)
obs_t = torch.tensor(state_vec, dtype=torch.float32).unsqueeze(0)
mask_t = torch.tensor(build_action_mask(obs, exclude_look=True), dtype=torch.float32).unsqueeze(0)
action_t, _, _ = network.act(obs_t, mask_t, deterministic=True)
obs = env.step(action_index_to_env_action(int(action_t.item())))
total_reward += float(obs.reward or 0.0)
frames.append(encoder.encode(obs))
if obs.done:
break
print(f"Episode finished β evacuated={obs.agent_evacuated} reward={total_reward:.3f}")
```
## Environment & training resources
- **HF Space (live env)**: [Krooz/pyre_env](https://huggingface.co/spaces/Krooz/pyre_env)
- **PPO training in Colab (HTTP to Space)**: [Pyre PPO training β Google Colab](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing)
- **Local HTTP trainer**: ``training/ppo/train_torch_ppo_http.py``
- **Local in-process trainer**: ``training/ppo/train_torch_ppo.py``
- **Notebook source**: ``training/ppo/pyre_ppo_training.ipynb``
|