File size: 12,142 Bytes
f08c1a3 a1b5377 8af299c a1b5377 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | ---
library_name: ml-agents
tags:
- Pyramids
- deep-reinforcement-learning
- reinforcement-learning
- ML-Agents-Pyramids
---
# **ppo** Agent playing **Pyramids**
This is a trained model of a **ppo** agent playing **Pyramids**
using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).
## Usage (with ML-Agents)
The Documentation: https://unity-technologies.github.io/ml-agents/ML-Agents-Toolkit-Documentation/
We wrote a complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub:
- A *short tutorial* where you teach Huggy the Dog πΆ to fetch the stick and then play with him directly in your
browser: https://huggingface.co/learn/deep-rl-course/unitbonus1/introduction
- A *longer tutorial* to understand how works ML-Agents:
https://huggingface.co/learn/deep-rl-course/unit5/introduction
### Resume the training
```bash
mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
```
### Watch your Agent play
You can watch your agent **playing directly in your browser**
1. If the environment is part of ML-Agents official environments, go to https://huggingface.co/unity
2. Step 1: Find your model_id: jetfan-xin/ppo-Pyramids
3. Step 2: Select your *.nn /*.onnx file
4. Click on Watch the agent play π
# π§ PPO Agent Trained on Unity Pyramids Environment
This repository contains a reinforcement learning agent trained using **Proximal Policy Optimization (PPO)** on Unityβs **Pyramids** environment via **ML-Agents**.
## π Model Overview
- **Algorithm**: PPO with RND (Random Network Distillation)
- **Environment**: Unity Pyramids (3D sparse-reward maze)
- **Framework**: ML-Agents v1.2.0.dev0
- **Backend**: PyTorch 2.7.1 (CUDA-enabled)
The agent learns to navigate a 3D maze and reach the goal area by combining extrinsic and intrinsic rewards.
---
## π How to Use This Model
You can use the `.onnx` model directly in Unity.
### β
Steps:
1. **Download the model**
Clone the repository or download `Pyramids.onnx`:
```bash
git lfs install
git clone https://huggingface.co/jetfan-xin/ppo-Pyramids
```
2. **Place in Unity project**
Put the model file in your Unity project under:
```
Assets/ML-Agents/Examples/Pyramids/Pyramids.onnx
```
3. **Assign in Unity Editor**
- Select your agent GameObject.
- In `Behavior Parameters`, assign `Pyramids.onnx` as the model.
- Make sure the Behavior Name matches your training config.
---
## βοΈ Training Configuration
Key settings from `configuration.yaml`:
- `trainer_type`: `ppo`
- `max_steps`: `1000000`
- `batch_size`: `128`, `buffer_size`: `2048`
- `learning_rate`: `3e-4`
- `reward_signals`:
- `extrinsic`: Ξ³=0.99, strength=1.0
- `rnd`: Ξ³=0.99, strength=0.01
- `hidden_units`: `512`, `num_layers`: `2`
- `summary_freq`: `30000`
See `configuration.yaml` for full details.
---
## π Training Performance
Sample rewards from training log:
| Step | Mean Reward |
|-----------|-------------|
| 300,000 | -0.22 |
| 480,000 | 0.35 |
| 660,000 | 1.14 |
| 840,000 | 1.47 |
| 990,000 | 1.54 |
details:
```
(rl_py310) 4xin@ltgpu3:~/deep_rl/unit5/ml-agents$ CUDA_VISIBLE_DEVICES=3 mlagents-learn ./config/ppo/PyramidsRND.yaml \
--env=./training-envs-executables/linux/Pyramids/Pyramids.x86_64 \
--run-id="PyramidsGPUTest" \
--no-graphics
β β
βββ¬ββ‘ βββ¬ββ
βββ¬ββββββ β¬ββββββ¬β
ββ¬ββββββ¬β ββ¬βββββββ βββ
β¬β¬β¬β¬ββββ¦β ββ¬ββββ£β£β£β¬ ββ£β£β¬ ββ£β£β£ βββ ββ£β£
β¬β¬β¬β¬β¬β¬β¬β¬βββ¬ββββ¬βͺβββ£β£β£β£β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£βββ£β£β£β β£β£β£ β£β£β£β£β£β£ ββ£β£β β£β£β£
β¬β¬β¬β¬β ββ¬β¬β¬β¬βββ£β£β£ββ β«β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£β£β ββ£β£β£ β£β£β£ βββ£β£ββ β«β£β£ ββ£β£
β¬β¬β¬β¬β ββ¬β¬β£β£ β«β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£β¬ β£β£β£ β£β£β£ ββ£β£ β£β£β£ββ£β£β
β¬β¬β¬β β¬β¬β£β£ βββ£β£β¬ ββ£β£β£βββββ£β£β£β ββ£β£β¬ β£β£β£ β£β£β£ ββ£β£β¦β β£β£β£β£β£
β ββ¦β β¬β¬β£β£ ββββ βββ£β£β£β£ββ ββββ βββ βββ ββ£β£β£ ββ£β£β£
β©β¬β¬β¬β¬β¬β¬β¦β¦β¬β¬β£β£ββ£β£β£β£β£β£β£β β«β£β£β£β£
ββ¬β¬β¬β¬β¬β¬β¬β£β£β£β£β£β£ββ
ββ¬β¬β¬β£β£β£β
β
Version information:
ml-agents: 1.2.0.dev0,
ml-agents-envs: 1.2.0.dev0,
Communicator API: 1.5.0,
PyTorch: 2.7.1+cu126
[INFO] Connected to Unity environment with package version 2.2.1-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Pyramids?team=0
[INFO] Hyperparameters for behavior name Pyramids:
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01
epsilon: 0.2
lambd: 0.95
num_epoch: 3
shared_critic: False
learning_rate_schedule: linear
beta_schedule: linear
epsilon_schedule: linear
checkpoint_interval: 500000
network_settings:
normalize: False
hidden_units: 512
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
network_settings:
normalize: False
hidden_units: 128
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
rnd:
gamma: 0.99
strength: 0.01
network_settings:
normalize: False
hidden_units: 64
num_layers: 3
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
learning_rate: 0.0001
encoding_size: None
init_path: None
keep_checkpoints: 5
even_checkpoints: False
max_steps: 1000000
time_horizon: 128
summary_freq: 30000
threaded: False
self_play: None
behavioral_cloning: None
[INFO] Pyramids. Step: 30000. Time Elapsed: 45.356 s. Mean Reward: -1.000. Std of Reward: 0.000. Training.
[INFO] Pyramids. Step: 60000. Time Elapsed: 90.519 s. Mean Reward: -0.853. Std of Reward: 0.588. Training.
[INFO] Pyramids. Step: 90000. Time Elapsed: 136.319 s. Mean Reward: -0.797. Std of Reward: 0.646. Training.
[INFO] Pyramids. Step: 120000. Time Elapsed: 182.893 s. Mean Reward: -0.831. Std of Reward: 0.654. Training.
[INFO] Pyramids. Step: 150000. Time Elapsed: 227.995 s. Mean Reward: -0.715. Std of Reward: 0.760. Training.
[INFO] Pyramids. Step: 180000. Time Elapsed: 270.527 s. Mean Reward: -0.731. Std of Reward: 0.712. Training.
[INFO] Pyramids. Step: 210000. Time Elapsed: 316.617 s. Mean Reward: -0.699. Std of Reward: 0.810. Training.
[INFO] Pyramids. Step: 240000. Time Elapsed: 361.434 s. Mean Reward: -0.640. Std of Reward: 0.822. Training.
[INFO] Pyramids. Step: 270000. Time Elapsed: 407.787 s. Mean Reward: -0.520. Std of Reward: 0.969. Training.
[INFO] Pyramids. Step: 300000. Time Elapsed: 451.612 s. Mean Reward: -0.222. Std of Reward: 1.135. Training.
[INFO] Pyramids. Step: 330000. Time Elapsed: 496.996 s. Mean Reward: -0.328. Std of Reward: 1.124. Training.
[INFO] Pyramids. Step: 360000. Time Elapsed: 541.248 s. Mean Reward: -0.452. Std of Reward: 0.995. Training.
[INFO] Pyramids. Step: 390000. Time Elapsed: 587.186 s. Mean Reward: -0.411. Std of Reward: 1.044. Training.
[INFO] Pyramids. Step: 420000. Time Elapsed: 630.923 s. Mean Reward: -0.042. Std of Reward: 1.228. Training.
[INFO] Pyramids. Step: 450000. Time Elapsed: 675.866 s. Mean Reward: 0.009. Std of Reward: 1.237. Training.
[INFO] Pyramids. Step: 480000. Time Elapsed: 721.391 s. Mean Reward: 0.351. Std of Reward: 1.271. Training.
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-499992.onnx
[INFO] Pyramids. Step: 510000. Time Elapsed: 767.344 s. Mean Reward: 0.647. Std of Reward: 1.140. Training.
[INFO] Pyramids. Step: 540000. Time Elapsed: 812.656 s. Mean Reward: 0.526. Std of Reward: 1.178. Training.
[INFO] Pyramids. Step: 570000. Time Elapsed: 857.156 s. Mean Reward: 0.525. Std of Reward: 1.236. Training.
[INFO] Pyramids. Step: 600000. Time Elapsed: 900.647 s. Mean Reward: 0.979. Std of Reward: 0.977. Training.
[INFO] Pyramids. Step: 630000. Time Elapsed: 949.947 s. Mean Reward: 1.044. Std of Reward: 1.040. Training.
[INFO] Pyramids. Step: 660000. Time Elapsed: 1006.810 s. Mean Reward: 1.143. Std of Reward: 0.937. Training.
[INFO] Pyramids. Step: 690000. Time Elapsed: 1062.833 s. Mean Reward: 1.151. Std of Reward: 0.997. Training.
[INFO] Pyramids. Step: 720000. Time Elapsed: 1119.948 s. Mean Reward: 1.499. Std of Reward: 0.563. Training.
[INFO] Pyramids. Step: 750000. Time Elapsed: 1178.547 s. Mean Reward: 1.308. Std of Reward: 0.835. Training.
[INFO] Pyramids. Step: 780000. Time Elapsed: 1226.204 s. Mean Reward: 1.278. Std of Reward: 0.866. Training.
[INFO] Pyramids. Step: 810000. Time Elapsed: 1275.499 s. Mean Reward: 1.318. Std of Reward: 0.856. Training.
[INFO] Pyramids. Step: 840000. Time Elapsed: 1322.302 s. Mean Reward: 1.477. Std of Reward: 0.641. Training.
[INFO] Pyramids. Step: 870000. Time Elapsed: 1370.429 s. Mean Reward: 1.367. Std of Reward: 0.816. Training.
[INFO] Pyramids. Step: 900000. Time Elapsed: 1418.228 s. Mean Reward: 1.471. Std of Reward: 0.689. Training.
[INFO] Pyramids. Step: 930000. Time Elapsed: 1465.721 s. Mean Reward: 1.514. Std of Reward: 0.619. Training.
[INFO] Pyramids. Step: 960000. Time Elapsed: 1513.116 s. Mean Reward: 1.403. Std of Reward: 0.810. Training.
[INFO] Pyramids. Step: 990000. Time Elapsed: 1563.057 s. Mean Reward: 1.544. Std of Reward: 0.666. Training.
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-999909.onnx
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx
[INFO] Copied results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx to results/PyramidsGPUTest/Pyramids.onnx.
```
β
Model exported to `Pyramids.onnx` after reaching max steps.
---
## π₯οΈ Training Setup
- **Run ID**: `PyramidsGPUTest`
- **GPU**: NVIDIA A100 80GB PCIe
- **Training time**: ~26 minutes
- **ML-Agents Envs**: v1.2.0.dev0
- **Communicator API**: v1.5.0
---
## π Repository Contents
| File / Folder | Description |
|------------------------|----------------------------------------------|
| `Pyramids.onnx` | Exported trained PPO agent |
| `configuration.yaml` | Full PPO + RND training config |
| `run_logs/` | Training logs from ML-Agents |
| `Pyramids/` | Environment-specific output folder |
| `config.json` | Metadata for Hugging Face model card |
---
## π Citation
If you use this model, please consider citing:
```
@misc{ppoPyramidsJetfan,
author = {Jingfan Xin},
title = {PPO Agent Trained on Unity Pyramids Environment},
year = {2025},
howpublished = {\url{https://huggingface.co/jetfan-xin/ppo-Pyramids}},
}
```
|