Krooz
/

pyre-ppo-agent

@@ -11,55 +11,139 @@ license: mit
 # Pyre PPO Agent — `Krooz/pyre-ppo-agent`
 PPO-trained actor-critic agent for the [Pyre](https://huggingface.co/spaces/Krooz/pyre_env)
-fire-evacuation environment, part of the OpenEnv Hackathon (Apr 2026).
-## Training summary
 | Metric | Value |
 |--------|-------|
-| Total episodes | ? |
-| Training time | ? min |
-| Final success rate (last 30 ep) | ? |
-| Final reward mean (last 30 ep) | ? |
-| Curriculum | `?` (patience-gated) |
-| Patience threshold | ? |
-## Hyperparameters
 | Param | Value |
 |-------|-------|
-| Learning rate | `?` |
-| PPO clip ε | `?` |
-| Entropy coeff | `?` |
-| Gamma | `?` |
-| Frame stack | `?` |
-| Hidden sizes | `?` |
-| Device | `?` |
 ## Files in this repository
 | File | Description |
 |------|-------------|
-| `pyre_ppo.pt` | PyTorch checkpoint (`network_state`, `optimizer_state`, `config`) |
-| `pyre_ppo.png` | Training graph — reward + success rate over episodes |
-| `pyre_ppo.csv` | Per-episode metrics |
-| `pyre_ppo_eval.csv` | Per-difficulty evaluation metrics |
-| `pyre_ppo_training.log` | Structured JSON-lines training log |
-## Loading the checkpoint
 ```python
 import torch
-ckpt = torch.load("pyre_ppo.pt", map_location="cpu", weights_only=False)
-# ckpt keys: network_state, optimizer_state, scheduler_state, episode, config
-print(ckpt["config"])   # input_dim, action_dim, hidden_sizes, history_length, obs_mode
 ```
-## Environment
-- **Space**: [Krooz/pyre_env](https://huggingface.co/spaces/Krooz/pyre_env)
-- **Training notebook**: [Google Colab](https://colab.research.google.com/drive/1ojC55qKXMVRXdjKeG5dUHiA5RBOBxA9V?usp=sharing)
-- **Source**: [pyre_env/training/ppo/train_torch_ppo.py](training/ppo/train_torch_ppo.py)

 # Pyre PPO Agent — `Krooz/pyre-ppo-agent`
 PPO-trained actor-critic agent for the [Pyre](https://huggingface.co/spaces/Krooz/pyre_env)
+fire-evacuation environment (OpenEnv Hackathon, Apr 2026).
+> ⚠️ This is a raw PyTorch checkpoint, **not** a `transformers` model.
+> The Hugging Face hosted Inference API cannot run it directly.
+> Use the inference code below to load and run it locally.
+## Training summary (artifact run: ``pyre_ppo_fixed``)
+Values below are from ``artifacts/pyre_ppo_fixed.csv``, ``pyre_ppo_fixed_eval.csv``,
+and ``artifacts/pyre_ppo_fixed_training.log`` (HTTP trainer, env server at ``http://localhost:8000``).
 | Metric | Value |
 |--------|-------|
+| Total episodes | **200** |
+| Wall-clock training time | **~48 s** (~4.2 eps/s on CPU) |
+| Final success rate (rolling last 30 ep) | **80%** |
+| Final reward mean (rolling last 30 ep) | **+8.446** |
+| Curriculum | **Static** ``easy,medium`` (≈100 eps each; ``--patience-threshold 0``) |
+| Eval cadence | Every **20** episodes, **3** deterministic rollouts |
+| Eval difficulty | **medium** (per eval log / ``pyre_ppo_fixed_eval.csv``) |
+## Network architecture (from training log)
+| Property | Value |
+|----------|-------|
+| Total parameters | **12,065,650** |
+| Input vector dim | **23,140** (encoder ``base_dim`` 5785 × **4** stacked frames) |
+| Action dim | **41** (4 move + 4 look + 1 wait + 16 door open + 16 door close) |
+| Hidden MLP | **512 → 256 → 128** |
+## Hyperparameters (defaults matching this run)
 | Param | Value |
 |-------|-------|
+| Learning rate | **3×10⁻⁴** |
+| PPO clip ε | **0.2** |
+| Entropy coeff | **0.03** |
+| Value coeff | **0.5** |
+| Gamma | **0.99** |
+| GAE λ | **0.95** |
+| PPO update every | **5** episodes |
+| PPO epochs / minibatch | **4** / **256** |
+| Max grad norm | **0.5** |
+| Observation mode | **visible** (partial observability) |
+| Device | **cpu** |
+### Evaluation checkpoints (from ``pyre_ppo_fixed_eval.csv``)
+| Episode | Difficulty | Success rate | Reward mean | Steps mean |
+|---------|------------|--------------|-------------|------------|
+| 20 | medium | 100% | +15.698 | 7.0 |
+| 40 | medium | 100% | +15.640 | 4.3 |
+| 60 | medium | 100% | +16.887 | 9.0 |
+| 80 | medium | 100% | +15.162 | 10.3 |
+| 100 | medium | 67% | +6.008 | 57.0 |
+| 120 | medium | 67% | +6.401 | 32.7 |
+| 140 | medium | 100% | +16.283 | 6.3 |
+| 160 | medium | 100% | +16.573 | 8.3 |
+| 180 | medium | 100% | +16.397 | 8.0 |
+| 200 | medium | 67% | +6.807 | 14.7 |
 ## Files in this repository
 | File | Description |
 |------|-------------|
+| `model.pt` | PyTorch checkpoint (`network_state`, `optimizer_state`, `scheduler_state`, `args`, `episode`) |
+| `training_graph.png` | Training curves (reward + success rate vs episode) |
+| `episode_metrics.csv` | Per-episode training metrics |
+| `eval_metrics.csv` | Periodic eval aggregates |
+| `training.log` | Full console transcript of the HTTP training run |
+## Running inference locally
 ```python
+import sys
 import torch
+from huggingface_hub import hf_hub_download
+# 1. Point Python at your local pyre_env checkout (or install the package)
+sys.path.insert(0, "pyre_env")
+from training.ppo.train_torch_ppo import (
+    ActorCritic,
+    ObservationEncoder,
+    action_index_to_env_action,
+    build_action_mask,
+)
+# 2. Download the checkpoint from this Hub repo
+ckpt_path = hf_hub_download(repo_id="Krooz/pyre-ppo-agent", filename="model.pt")
+ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
+# 3. Rebuild the policy from saved training args
+saved_args = ckpt["args"]
+encoder = ObservationEncoder(mode=saved_args.get("observation_mode", "visible"))
+hidden_sizes = tuple(int(x) for x in saved_args.get("hidden_sizes", "512,256,128").split(","))
+history_length = saved_args.get("history_length", 4)
+input_dim = encoder.base_dim * history_length
+network = ActorCritic(input_dim, 41, hidden_sizes)
+network.load_state_dict(ckpt["network_state"])
+network.eval()
+print(f"Loaded checkpoint from episode {ckpt.get('episode', '?')}")
+# 4. Roll out one episode (in-process env — swap for HTTP client if you prefer)
+from openenv_pyre import PyreEnvironment
+from collections import deque
+import numpy as np
+env = PyreEnvironment()
+obs = env.reset(difficulty="medium")
+frames = deque([np.zeros(encoder.base_dim, dtype=np.float32)] * history_length, maxlen=history_length)
+frames.append(encoder.encode(obs))
+total_reward = 0.0
+with torch.no_grad():
+    while True:
+        state_vec = np.concatenate(list(frames), dtype=np.float32)
+        obs_t = torch.tensor(state_vec, dtype=torch.float32).unsqueeze(0)
+        mask_t = torch.tensor(build_action_mask(obs, exclude_look=True), dtype=torch.float32).unsqueeze(0)
+        action_t, _, _ = network.act(obs_t, mask_t, deterministic=True)
+        obs = env.step(action_index_to_env_action(int(action_t.item())))
+        total_reward += float(obs.reward or 0.0)
+        frames.append(encoder.encode(obs))
+        if obs.done:
+            break
+print(f"Episode finished — evacuated={obs.agent_evacuated}  reward={total_reward:.3f}")
 ```
+## Environment & training resources
+- **HF Space (live env)**: [Krooz/pyre_env](https://huggingface.co/spaces/Krooz/pyre_env)
+- **PPO training in Colab (HTTP to Space)**: [Pyre PPO training — Google Colab](https://colab.research.google.com/drive/1ojC55qKXMVRXdjKeG5dUHiA5RBOBxA9V?usp=sharing)
+- **Local HTTP trainer**: ``training/ppo/train_torch_ppo_http.py``
+- **Local in-process trainer**: ``training/ppo/train_torch_ppo.py``
+- **Notebook source**: ``training/ppo/pyre_ppo_training.ipynb``