Update README.md

937634b verified 19 days ago

7.26 kB

	---
	tags:
	- openenv
	- reinforcement-learning
	- ppo
	- pyre
	- fire-evacuation
	license: mit
	---

	# Pyre PPO Agent — `krooz/pyre-ppo-agent`

	PPO-trained actor-critic agent for the [Pyre](https://huggingface.co/spaces/Krooz/pyre_env)
	fire-evacuation environment (OpenEnv Hackathon, Apr 2026).

	> ⚠️ This is a raw PyTorch checkpoint, not a `transformers` model.
	> The Hugging Face hosted Inference API cannot run it directly.
	> Use the inference code below to load and run it locally.

	## Training summary (artifact run: ``pyre_ppo_hard_v2``)

	Values below are from ``artifacts/pyre_ppo_hard_v2.csv``, ``pyre_ppo_hard_v2_eval.csv``,
	``pyre_ppo_hard_v2.png`` (MA-20 curves match ``save_training_graph_png`` in ``train_torch_ppo.py``),
	and ``artifacts/pyre_ppo_hard_v2_training.log`` (HTTP trainer via ``train_torch_ppo_http.py``, env at ``http://localhost:8000``).

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total episodes \| 600 \|
	\| Wall-clock training time \| ~227 s (~2.6 eps/s) \|
	\| Final success rate (MA-20, training graph title) \| 55% \|
	\| Final reward mean (MA-20) \| +3.21 \|
	\| Final success rate (rolling last 30 ep, CSV ``s30`` / log) \| 47% \|
	\| Overall evacuation rate (all 600 ep, CSV) \| 52.7% \|
	\| Per-difficulty evacuation (easy / medium / hard) \| 67.7% / 59.5% / 10.5% \|
	\| Curriculum \| easy → medium → hard with patience gate (0.70 over 20 ep); hard-phase mix hard:0.4, medium:0.4, easy:0.2 \|
	\| Eval cadence \| Every 25 episodes, 5 deterministic rollouts \|
	\| Eval difficulty \| hard (``pyre_ppo_hard_v2_eval.csv``) \|

	### Training command (this run)

	```bash
	uv run python training/ppo/train_torch_ppo_http.py \
	--episodes 600 \
	--difficulty-schedule easy,medium,hard \
	--patience-threshold 0.70 \
	--patience-window 20 \
	--hard-mix-dist hard:0.4,medium:0.4,easy:0.2 \
	--update-every 8 \
	--update-epochs 6 \
	--eval-every 25 \
	--eval-difficulty hard \
	--eval-episodes 5 \
	--checkpoint-every 50 \
	--entropy-coef 0.05 \
	--step-delay 0 \
	--viz-after-ep 500 \
	--output artifacts/pyre_ppo_hard_v2.pt \
	--log-file artifacts/pyre_ppo_hard_v2_training.log
	```

	## Network architecture (from training log)

	\| Property \| Value \|
	\|----------\|-------\|
	\| Total parameters \| 12,065,650 \|
	\| Input vector dim \| 23,140 (encoder ``base_dim`` 5785 × 4 stacked frames) \|
	\| Action dim \| 41 (4 move + 4 look + 1 wait + 16 door open + 16 door close) \|
	\| Hidden MLP \| 512 → 256 → 128 \|

	## Hyperparameters (this run)

	\| Param \| Value \|
	\|-------\|-------\|
	\| Learning rate \| 3×10⁻⁴ (with LR decay toward 0.1× end factor unless disabled) \|
	\| PPO clip ε \| 0.2 \|
	\| Entropy coeff \| 0.05 \|
	\| Value coeff \| 0.5 \|
	\| Gamma \| 0.99 \|
	\| GAE λ \| 0.95 \|
	\| PPO update every \| 8 episodes \|
	\| PPO epochs / minibatch \| 6 / 256 \|
	\| Max grad norm \| 0.5 \|
	\| Observation mode \| visible (partial observability) \|
	\| Device \| cuda (``train_torch_ppo.py`` default; set ``--device cpu`` if needed) \|

	### Periodic eval on hard (from ``pyre_ppo_hard_v2_eval.csv``)

	\| Episode \| Difficulty \| Success rate \| Reward mean \| Steps mean \|
	\|---------\|------------\|--------------\|-------------\|------------\|
	\| 25 \| hard \| 0% \| −10.124 \| 58.0 \|
	\| 50 \| hard \| 0% \| −11.184 \| 58.4 \|
	\| 75 \| hard \| 0% \| −11.468 \| 35.6 \|
	\| 100 \| hard \| 0% \| −9.827 \| 74.0 \|
	\| 125 \| hard \| 20% \| −7.792 \| 25.0 \|
	\| 150 \| hard \| 40% \| −4.238 \| 28.0 \|
	\| 175 \| hard \| 20% \| −6.674 \| 35.2 \|
	\| 200 \| hard \| 0% \| −12.304 \| 74.6 \|
	\| 225 \| hard \| 0% \| −11.080 \| 100.0 \|
	\| 250 \| hard \| 20% \| −5.648 \| 38.4 \|
	\| 275 \| hard \| 0% \| −10.368 \| 76.2 \|
	\| 300 \| hard \| 20% \| −4.421 \| 72.8 \|
	\| 325 \| hard \| 0% \| −11.180 \| 48.2 \|
	\| 350 \| hard \| 0% \| −9.845 \| 74.0 \|
	\| 375 \| hard \| 0% \| −11.320 \| 26.4 \|
	\| 400 \| hard \| 0% \| −12.256 \| 34.0 \|
	\| 425 \| hard \| 20% \| −7.024 \| 36.4 \|
	\| 450 \| hard \| 0% \| −10.726 \| 56.4 \|
	\| 475 \| hard \| 0% \| −9.072 \| 88.6 \|
	\| 500 \| hard \| 0% \| −12.050 \| 66.6 \|
	\| 525 \| hard \| 20% \| −5.528 \| 41.6 \|
	\| 550 \| hard \| 0% \| −11.274 \| 52.4 \|
	\| 575 \| hard \| 0% \| −10.578 \| 58.4 \|
	\| 600 \| hard \| 0% \| −12.068 \| 36.6 \|

	## Files in this repository

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.pt` \| PyTorch checkpoint (`network_state`, `optimizer_state`, `scheduler_state`, `args`, `episode`) \|
	\| `training_graph.png` \| Training curves (reward + success rate vs episode) \|
	\| `episode_metrics.csv` \| Per-episode training metrics \|
	\| `eval_metrics.csv` \| Periodic eval aggregates \|
	\| `training.log` \| Full console transcript of the HTTP training run \|

	## Running inference locally

	```python
	import sys
	import torch
	from huggingface_hub import hf_hub_download

	# 1. Point Python at your local pyre_env checkout (or install the package)
	sys.path.insert(0, "pyre_env")

	from training.ppo.train_torch_ppo import (
	ActorCritic,
	ObservationEncoder,
	action_index_to_env_action,
	build_action_mask,
	)

	# 2. Download the checkpoint from this Hub repo
	ckpt_path = hf_hub_download(repo_id="krooz/pyre-ppo-agent", filename="model.pt")
	ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)

	# 3. Rebuild the policy from saved training args
	saved_args = ckpt["args"]
	encoder = ObservationEncoder(mode=saved_args.get("observation_mode", "visible"))
	hidden_sizes = tuple(int(x) for x in saved_args.get("hidden_sizes", "512,256,128").split(","))
	history_length = saved_args.get("history_length", 4)
	input_dim = encoder.base_dim * history_length
	network = ActorCritic(input_dim, 41, hidden_sizes)
	network.load_state_dict(ckpt["network_state"])
	network.eval()
	print(f"Loaded checkpoint from episode {ckpt.get('episode', '?')}")

	# 4. Roll out one episode (in-process env — swap for HTTP client if you prefer)
	from openenv_pyre import PyreEnvironment
	from collections import deque
	import numpy as np

	env = PyreEnvironment()
	obs = env.reset(difficulty="medium")
	frames = deque([np.zeros(encoder.base_dim, dtype=np.float32)] * history_length, maxlen=history_length)
	frames.append(encoder.encode(obs))

	total_reward = 0.0
	with torch.no_grad():
	while True:
	state_vec = np.concatenate(list(frames), dtype=np.float32)
	obs_t = torch.tensor(state_vec, dtype=torch.float32).unsqueeze(0)
	mask_t = torch.tensor(build_action_mask(obs, exclude_look=True), dtype=torch.float32).unsqueeze(0)
	action_t, _, _ = network.act(obs_t, mask_t, deterministic=True)
	obs = env.step(action_index_to_env_action(int(action_t.item())))
	total_reward += float(obs.reward or 0.0)
	frames.append(encoder.encode(obs))
	if obs.done:
	break

	print(f"Episode finished — evacuated={obs.agent_evacuated} reward={total_reward:.3f}")
	```

	## Environment & training resources

	- HF Space (live env): [Krooz/pyre_env](https://huggingface.co/spaces/Krooz/pyre_env)
	- PPO training in Colab (HTTP to Space): [Pyre PPO training — Google Colab](https://colab.research.google.com/drive/1JPIajg0BAKEriNAwgGRnN7LXEcyCeiEV?usp=sharing)
	- Local HTTP trainer: ``training/ppo/train_torch_ppo_http.py``
	- Local in-process trainer: ``training/ppo/train_torch_ppo.py``
	- Notebook source: ``training/ppo/pyre_ppo_training.ipynb``