File size: 12,142 Bytes
f08c1a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b5377
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8af299c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b5377
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
library_name: ml-agents
tags:
- Pyramids
- deep-reinforcement-learning
- reinforcement-learning
- ML-Agents-Pyramids
---

  # **ppo** Agent playing **Pyramids**
  This is a trained model of a **ppo** agent playing **Pyramids**
  using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).

  ## Usage (with ML-Agents)
  The Documentation: https://unity-technologies.github.io/ml-agents/ML-Agents-Toolkit-Documentation/

  We wrote a complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub:
  - A *short tutorial* where you teach Huggy the Dog 🐢 to fetch the stick and then play with him directly in your
  browser: https://huggingface.co/learn/deep-rl-course/unitbonus1/introduction
  - A *longer tutorial* to understand how works ML-Agents:
  https://huggingface.co/learn/deep-rl-course/unit5/introduction

  ### Resume the training
  ```bash
  mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
  ```

  ### Watch your Agent play
  You can watch your agent **playing directly in your browser**

  1. If the environment is part of ML-Agents official environments, go to https://huggingface.co/unity
  2. Step 1: Find your model_id: jetfan-xin/ppo-Pyramids
  3. Step 2: Select your *.nn /*.onnx file
  4. Click on Watch the agent play πŸ‘€


# 🧠 PPO Agent Trained on Unity Pyramids Environment

This repository contains a reinforcement learning agent trained using **Proximal Policy Optimization (PPO)** on Unity’s **Pyramids** environment via **ML-Agents**.

## πŸ“Œ Model Overview

- **Algorithm**: PPO with RND (Random Network Distillation)
- **Environment**: Unity Pyramids (3D sparse-reward maze)
- **Framework**: ML-Agents v1.2.0.dev0
- **Backend**: PyTorch 2.7.1 (CUDA-enabled)

The agent learns to navigate a 3D maze and reach the goal area by combining extrinsic and intrinsic rewards.

---

## πŸš€ How to Use This Model

You can use the `.onnx` model directly in Unity.

### βœ… Steps:

1. **Download the model**

   Clone the repository or download `Pyramids.onnx`:

   ```bash
   git lfs install
   git clone https://huggingface.co/jetfan-xin/ppo-Pyramids
   ```

2. **Place in Unity project**

   Put the model file in your Unity project under:

   ```
   Assets/ML-Agents/Examples/Pyramids/Pyramids.onnx
   ```

3. **Assign in Unity Editor**

   - Select your agent GameObject.
   - In `Behavior Parameters`, assign `Pyramids.onnx` as the model.
   - Make sure the Behavior Name matches your training config.

---

## βš™οΈ Training Configuration

Key settings from `configuration.yaml`:

- `trainer_type`: `ppo`  
- `max_steps`: `1000000`  
- `batch_size`: `128`, `buffer_size`: `2048`  
- `learning_rate`: `3e-4`  
- `reward_signals`:  
  - `extrinsic`: Ξ³=0.99, strength=1.0  
  - `rnd`: Ξ³=0.99, strength=0.01  
- `hidden_units`: `512`, `num_layers`: `2`  
- `summary_freq`: `30000`

See `configuration.yaml` for full details.

---

## πŸ“ˆ Training Performance

Sample rewards from training log:

| Step      | Mean Reward |
|-----------|-------------|
| 300,000   | -0.22       |
| 480,000   |  0.35       |
| 660,000   |  1.14       |
| 840,000   |  1.47       |
| 990,000   |  1.54       |

details:
```
(rl_py310) 4xin@ltgpu3:~/deep_rl/unit5/ml-agents$ CUDA_VISIBLE_DEVICES=3 mlagents-learn ./config/ppo/PyramidsRND.yaml \
  --env=./training-envs-executables/linux/Pyramids/Pyramids.x86_64 \
  --run-id="PyramidsGPUTest" \
  --no-graphics

            ┐  β•–
        ╓╖╬│║  ││╬╖╖
    β•“β•–β•¬β”‚β”‚β”‚β”‚β”‚β”˜  ╬│││││╬╖
 β•–β•¬β”‚β”‚β”‚β”‚β”‚β•¬β•œ        ╙╬│││││╖╖                               β•—β•—β•—
 ╬╬╬╬╖││╦╖        ╖╬││╗╣╣╣╬      β•Ÿβ•£β•£β•¬    β•Ÿβ•£β•£β•£             β•œβ•œβ•œ  β•Ÿβ•£β•£
 ╬╬╬╬╬╬╬╬╖│╬╖╖╓╬β•ͺ│╓╣╣╣╣╣╣╣╬      β•Ÿβ•£β•£β•¬    β•Ÿβ•£β•£β•£ β•’β•£β•£β•–β•—β•£β•£β•£β•—   β•£β•£β•£ β•£β•£β•£β•£β•£β•£ β•Ÿβ•£β•£β•–   β•£β•£β•£
 ╬╬╬╬┐  β•™β•¬β•¬β•¬β•¬β”‚β•“β•£β•£β•£β•β•œ  ╫╣╣╣╬      β•Ÿβ•£β•£β•¬    β•Ÿβ•£β•£β•£ β•Ÿβ•£β•£β•£β•™ β•™β•£β•£β•£  β•£β•£β•£ β•™β•Ÿβ•£β•£β•œβ•™  β•«β•£β•£  β•Ÿβ•£β•£
 ╬╬╬╬┐     ╙╬╬╣╣      ╫╣╣╣╬      β•Ÿβ•£β•£β•¬    β•Ÿβ•£β•£β•£ β•Ÿβ•£β•£β•¬   β•£β•£β•£  β•£β•£β•£  β•Ÿβ•£β•£     β•£β•£β•£β”Œβ•£β•£β•œ
 β•¬β•¬β•¬β•œ       ╬╬╣╣      ╙╝╣╣╬      β•™β•£β•£β•£β•—β•–β•“β•—β•£β•£β•£β•œ β•Ÿβ•£β•£β•¬   β•£β•£β•£  β•£β•£β•£  β•Ÿβ•£β•£β•¦β•“    β•£β•£β•£β•£β•£
 β•™   ╓╦╖    ╬╬╣╣   β•“β•—β•—β•–            β•™β•β•£β•£β•£β•£β•β•œ   β•˜β•β•β•œ   ╝╝╝  ╝╝╝   β•™β•£β•£β•£    β•Ÿβ•£β•£β•£
   ╩╬╬╬╬╬╬╦╦╬╬╣╣╗╣╣╣╣╣╣╣╝                                             β•«β•£β•£β•£β•£
      β•™β•¬β•¬β•¬β•¬β•¬β•¬β•¬β•£β•£β•£β•£β•£β•£β•β•œ
          β•™β•¬β•¬β•¬β•£β•£β•£β•œ
             β•™
        
 Version information:
  ml-agents: 1.2.0.dev0,
  ml-agents-envs: 1.2.0.dev0,
  Communicator API: 1.5.0,
  PyTorch: 2.7.1+cu126
[INFO] Connected to Unity environment with package version 2.2.1-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Pyramids?team=0
[INFO] Hyperparameters for behavior name Pyramids: 
        trainer_type:   ppo
        hyperparameters:        
          batch_size:   128
          buffer_size:  2048
          learning_rate:        0.0003
          beta: 0.01
          epsilon:      0.2
          lambd:        0.95
          num_epoch:    3
          shared_critic:        False
          learning_rate_schedule:       linear
          beta_schedule:        linear
          epsilon_schedule:     linear
        checkpoint_interval:    500000
        network_settings:       
          normalize:    False
          hidden_units: 512
          num_layers:   2
          vis_encode_type:      simple
          memory:       None
          goal_conditioning_type:       hyper
          deterministic:        False
        reward_signals: 
          extrinsic:    
            gamma:      0.99
            strength:   1.0
            network_settings:   
              normalize:        False
              hidden_units:     128
              num_layers:       2
              vis_encode_type:  simple
              memory:   None
              goal_conditioning_type:   hyper
              deterministic:    False
          rnd:  
            gamma:      0.99
            strength:   0.01
            network_settings:   
              normalize:        False
              hidden_units:     64
              num_layers:       3
              vis_encode_type:  simple
              memory:   None
              goal_conditioning_type:   hyper
              deterministic:    False
            learning_rate:      0.0001
            encoding_size:      None
        init_path:      None
        keep_checkpoints:       5
        even_checkpoints:       False
        max_steps:      1000000
        time_horizon:   128
        summary_freq:   30000
        threaded:       False
        self_play:      None
        behavioral_cloning:     None
[INFO] Pyramids. Step: 30000. Time Elapsed: 45.356 s. Mean Reward: -1.000. Std of Reward: 0.000. Training.
[INFO] Pyramids. Step: 60000. Time Elapsed: 90.519 s. Mean Reward: -0.853. Std of Reward: 0.588. Training.
[INFO] Pyramids. Step: 90000. Time Elapsed: 136.319 s. Mean Reward: -0.797. Std of Reward: 0.646. Training.
[INFO] Pyramids. Step: 120000. Time Elapsed: 182.893 s. Mean Reward: -0.831. Std of Reward: 0.654. Training.
[INFO] Pyramids. Step: 150000. Time Elapsed: 227.995 s. Mean Reward: -0.715. Std of Reward: 0.760. Training.
[INFO] Pyramids. Step: 180000. Time Elapsed: 270.527 s. Mean Reward: -0.731. Std of Reward: 0.712. Training.
[INFO] Pyramids. Step: 210000. Time Elapsed: 316.617 s. Mean Reward: -0.699. Std of Reward: 0.810. Training.
[INFO] Pyramids. Step: 240000. Time Elapsed: 361.434 s. Mean Reward: -0.640. Std of Reward: 0.822. Training.
[INFO] Pyramids. Step: 270000. Time Elapsed: 407.787 s. Mean Reward: -0.520. Std of Reward: 0.969. Training.
[INFO] Pyramids. Step: 300000. Time Elapsed: 451.612 s. Mean Reward: -0.222. Std of Reward: 1.135. Training.
[INFO] Pyramids. Step: 330000. Time Elapsed: 496.996 s. Mean Reward: -0.328. Std of Reward: 1.124. Training.
[INFO] Pyramids. Step: 360000. Time Elapsed: 541.248 s. Mean Reward: -0.452. Std of Reward: 0.995. Training.
[INFO] Pyramids. Step: 390000. Time Elapsed: 587.186 s. Mean Reward: -0.411. Std of Reward: 1.044. Training.
[INFO] Pyramids. Step: 420000. Time Elapsed: 630.923 s. Mean Reward: -0.042. Std of Reward: 1.228. Training.
[INFO] Pyramids. Step: 450000. Time Elapsed: 675.866 s. Mean Reward: 0.009. Std of Reward: 1.237. Training.
[INFO] Pyramids. Step: 480000. Time Elapsed: 721.391 s. Mean Reward: 0.351. Std of Reward: 1.271. Training.
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-499992.onnx
[INFO] Pyramids. Step: 510000. Time Elapsed: 767.344 s. Mean Reward: 0.647. Std of Reward: 1.140. Training.
[INFO] Pyramids. Step: 540000. Time Elapsed: 812.656 s. Mean Reward: 0.526. Std of Reward: 1.178. Training.
[INFO] Pyramids. Step: 570000. Time Elapsed: 857.156 s. Mean Reward: 0.525. Std of Reward: 1.236. Training.
[INFO] Pyramids. Step: 600000. Time Elapsed: 900.647 s. Mean Reward: 0.979. Std of Reward: 0.977. Training.
[INFO] Pyramids. Step: 630000. Time Elapsed: 949.947 s. Mean Reward: 1.044. Std of Reward: 1.040. Training.
[INFO] Pyramids. Step: 660000. Time Elapsed: 1006.810 s. Mean Reward: 1.143. Std of Reward: 0.937. Training.
[INFO] Pyramids. Step: 690000. Time Elapsed: 1062.833 s. Mean Reward: 1.151. Std of Reward: 0.997. Training.
[INFO] Pyramids. Step: 720000. Time Elapsed: 1119.948 s. Mean Reward: 1.499. Std of Reward: 0.563. Training.
[INFO] Pyramids. Step: 750000. Time Elapsed: 1178.547 s. Mean Reward: 1.308. Std of Reward: 0.835. Training.
[INFO] Pyramids. Step: 780000. Time Elapsed: 1226.204 s. Mean Reward: 1.278. Std of Reward: 0.866. Training.
[INFO] Pyramids. Step: 810000. Time Elapsed: 1275.499 s. Mean Reward: 1.318. Std of Reward: 0.856. Training.
[INFO] Pyramids. Step: 840000. Time Elapsed: 1322.302 s. Mean Reward: 1.477. Std of Reward: 0.641. Training.
[INFO] Pyramids. Step: 870000. Time Elapsed: 1370.429 s. Mean Reward: 1.367. Std of Reward: 0.816. Training.
[INFO] Pyramids. Step: 900000. Time Elapsed: 1418.228 s. Mean Reward: 1.471. Std of Reward: 0.689. Training.
[INFO] Pyramids. Step: 930000. Time Elapsed: 1465.721 s. Mean Reward: 1.514. Std of Reward: 0.619. Training.
[INFO] Pyramids. Step: 960000. Time Elapsed: 1513.116 s. Mean Reward: 1.403. Std of Reward: 0.810. Training.
[INFO] Pyramids. Step: 990000. Time Elapsed: 1563.057 s. Mean Reward: 1.544. Std of Reward: 0.666. Training.
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-999909.onnx
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx
[INFO] Copied results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx to results/PyramidsGPUTest/Pyramids.onnx.
```

βœ… Model exported to `Pyramids.onnx` after reaching max steps.

---

## πŸ–₯️ Training Setup

- **Run ID**: `PyramidsGPUTest`  
- **GPU**: NVIDIA A100 80GB PCIe  
- **Training time**: ~26 minutes  
- **ML-Agents Envs**: v1.2.0.dev0  
- **Communicator API**: v1.5.0  

---

## πŸ“ Repository Contents

| File / Folder         | Description                                  |
|------------------------|----------------------------------------------|
| `Pyramids.onnx`        | Exported trained PPO agent                  |
| `configuration.yaml`   | Full PPO + RND training config              |
| `run_logs/`            | Training logs from ML-Agents                |
| `Pyramids/`            | Environment-specific output folder          |
| `config.json`          | Metadata for Hugging Face model card        |

---

## πŸ“š Citation

If you use this model, please consider citing:

```
@misc{ppoPyramidsJetfan,
  author = {Jingfan Xin},
  title = {PPO Agent Trained on Unity Pyramids Environment},
  year = {2025},
  howpublished = {\url{https://huggingface.co/jetfan-xin/ppo-Pyramids}},
}
```