| | --- |
| | library_name: ml-agents |
| | tags: |
| | - Pyramids |
| | - deep-reinforcement-learning |
| | - reinforcement-learning |
| | - ML-Agents-Pyramids |
| | --- |
| | |
| | # **ppo** Agent playing **Pyramids** |
| | This is a trained model of a **ppo** agent playing **Pyramids** |
| | using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents). |
| |
|
| | ## Usage (with ML-Agents) |
| | The Documentation: https://unity-technologies.github.io/ml-agents/ML-Agents-Toolkit-Documentation/ |
| |
|
| | We wrote a complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub: |
| | - A *short tutorial* where you teach Huggy the Dog πΆ to fetch the stick and then play with him directly in your |
| | browser: https://huggingface.co/learn/deep-rl-course/unitbonus1/introduction |
| | - A *longer tutorial* to understand how works ML-Agents: |
| | https://huggingface.co/learn/deep-rl-course/unit5/introduction |
| |
|
| | ### Resume the training |
| | ```bash |
| | mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume |
| | ``` |
| |
|
| | ### Watch your Agent play |
| | You can watch your agent **playing directly in your browser** |
| |
|
| | 1. If the environment is part of ML-Agents official environments, go to https://huggingface.co/unity |
| | 2. Step 1: Find your model_id: jetfan-xin/ppo-Pyramids |
| | 3. Step 2: Select your *.nn /*.onnx file |
| | 4. Click on Watch the agent play π |
| | |
| | |
| | # π§ PPO Agent Trained on Unity Pyramids Environment |
| | |
| | This repository contains a reinforcement learning agent trained using **Proximal Policy Optimization (PPO)** on Unityβs **Pyramids** environment via **ML-Agents**. |
| | |
| | ## π Model Overview |
| | |
| | - **Algorithm**: PPO with RND (Random Network Distillation) |
| | - **Environment**: Unity Pyramids (3D sparse-reward maze) |
| | - **Framework**: ML-Agents v1.2.0.dev0 |
| | - **Backend**: PyTorch 2.7.1 (CUDA-enabled) |
| | |
| | The agent learns to navigate a 3D maze and reach the goal area by combining extrinsic and intrinsic rewards. |
| | |
| | --- |
| | |
| | ## π How to Use This Model |
| | |
| | You can use the `.onnx` model directly in Unity. |
| | |
| | ### β
Steps: |
| | |
| | 1. **Download the model** |
| | |
| | Clone the repository or download `Pyramids.onnx`: |
| | |
| | ```bash |
| | git lfs install |
| | git clone https://huggingface.co/jetfan-xin/ppo-Pyramids |
| | ``` |
| | |
| | 2. **Place in Unity project** |
| | |
| | Put the model file in your Unity project under: |
| | |
| | ``` |
| | Assets/ML-Agents/Examples/Pyramids/Pyramids.onnx |
| | ``` |
| | |
| | 3. **Assign in Unity Editor** |
| | |
| | - Select your agent GameObject. |
| | - In `Behavior Parameters`, assign `Pyramids.onnx` as the model. |
| | - Make sure the Behavior Name matches your training config. |
| | |
| | --- |
| | |
| | ## βοΈ Training Configuration |
| | |
| | Key settings from `configuration.yaml`: |
| | |
| | - `trainer_type`: `ppo` |
| | - `max_steps`: `1000000` |
| | - `batch_size`: `128`, `buffer_size`: `2048` |
| | - `learning_rate`: `3e-4` |
| | - `reward_signals`: |
| | - `extrinsic`: Ξ³=0.99, strength=1.0 |
| | - `rnd`: Ξ³=0.99, strength=0.01 |
| | - `hidden_units`: `512`, `num_layers`: `2` |
| | - `summary_freq`: `30000` |
| |
|
| | See `configuration.yaml` for full details. |
| |
|
| | --- |
| |
|
| | ## π Training Performance |
| |
|
| | Sample rewards from training log: |
| |
|
| | | Step | Mean Reward | |
| | |-----------|-------------| |
| | | 300,000 | -0.22 | |
| | | 480,000 | 0.35 | |
| | | 660,000 | 1.14 | |
| | | 840,000 | 1.47 | |
| | | 990,000 | 1.54 | |
| |
|
| | details: |
| | ``` |
| | (rl_py310) 4xin@ltgpu3:~/deep_rl/unit5/ml-agents$ CUDA_VISIBLE_DEVICES=3 mlagents-learn ./config/ppo/PyramidsRND.yaml \ |
| | --env=./training-envs-executables/linux/Pyramids/Pyramids.x86_64 \ |
| | --run-id="PyramidsGPUTest" \ |
| | --no-graphics |
| | |
| | β β |
| | βββ¬ββ‘ βββ¬ββ |
| | βββ¬ββββββ β¬ββββββ¬β |
| | ββ¬ββββββ¬β ββ¬βββββββ βββ |
| | β¬β¬β¬β¬ββββ¦β ββ¬ββββ£β£β£β¬ ββ£β£β¬ ββ£β£β£ βββ ββ£β£ |
| | β¬β¬β¬β¬β¬β¬β¬β¬βββ¬ββββ¬βͺβββ£β£β£β£β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£βββ£β£β£β β£β£β£ β£β£β£β£β£β£ ββ£β£β β£β£β£ |
| | β¬β¬β¬β¬β ββ¬β¬β¬β¬βββ£β£β£ββ β«β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£β£β ββ£β£β£ β£β£β£ βββ£β£ββ β«β£β£ ββ£β£ |
| | β¬β¬β¬β¬β ββ¬β¬β£β£ β«β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£β¬ β£β£β£ β£β£β£ ββ£β£ β£β£β£ββ£β£β |
| | β¬β¬β¬β β¬β¬β£β£ βββ£β£β¬ ββ£β£β£βββββ£β£β£β ββ£β£β¬ β£β£β£ β£β£β£ ββ£β£β¦β β£β£β£β£β£ |
| | β ββ¦β β¬β¬β£β£ ββββ βββ£β£β£β£ββ ββββ βββ βββ ββ£β£β£ ββ£β£β£ |
| | β©β¬β¬β¬β¬β¬β¬β¦β¦β¬β¬β£β£ββ£β£β£β£β£β£β£β β«β£β£β£β£ |
| | ββ¬β¬β¬β¬β¬β¬β¬β£β£β£β£β£β£ββ |
| | ββ¬β¬β¬β£β£β£β |
| | β |
| | |
| | Version information: |
| | ml-agents: 1.2.0.dev0, |
| | ml-agents-envs: 1.2.0.dev0, |
| | Communicator API: 1.5.0, |
| | PyTorch: 2.7.1+cu126 |
| | [INFO] Connected to Unity environment with package version 2.2.1-exp.1 and communication version 1.5.0 |
| | [INFO] Connected new brain: Pyramids?team=0 |
| | [INFO] Hyperparameters for behavior name Pyramids: |
| | trainer_type: ppo |
| | hyperparameters: |
| | batch_size: 128 |
| | buffer_size: 2048 |
| | learning_rate: 0.0003 |
| | beta: 0.01 |
| | epsilon: 0.2 |
| | lambd: 0.95 |
| | num_epoch: 3 |
| | shared_critic: False |
| | learning_rate_schedule: linear |
| | beta_schedule: linear |
| | epsilon_schedule: linear |
| | checkpoint_interval: 500000 |
| | network_settings: |
| | normalize: False |
| | hidden_units: 512 |
| | num_layers: 2 |
| | vis_encode_type: simple |
| | memory: None |
| | goal_conditioning_type: hyper |
| | deterministic: False |
| | reward_signals: |
| | extrinsic: |
| | gamma: 0.99 |
| | strength: 1.0 |
| | network_settings: |
| | normalize: False |
| | hidden_units: 128 |
| | num_layers: 2 |
| | vis_encode_type: simple |
| | memory: None |
| | goal_conditioning_type: hyper |
| | deterministic: False |
| | rnd: |
| | gamma: 0.99 |
| | strength: 0.01 |
| | network_settings: |
| | normalize: False |
| | hidden_units: 64 |
| | num_layers: 3 |
| | vis_encode_type: simple |
| | memory: None |
| | goal_conditioning_type: hyper |
| | deterministic: False |
| | learning_rate: 0.0001 |
| | encoding_size: None |
| | init_path: None |
| | keep_checkpoints: 5 |
| | even_checkpoints: False |
| | max_steps: 1000000 |
| | time_horizon: 128 |
| | summary_freq: 30000 |
| | threaded: False |
| | self_play: None |
| | behavioral_cloning: None |
| | [INFO] Pyramids. Step: 30000. Time Elapsed: 45.356 s. Mean Reward: -1.000. Std of Reward: 0.000. Training. |
| | [INFO] Pyramids. Step: 60000. Time Elapsed: 90.519 s. Mean Reward: -0.853. Std of Reward: 0.588. Training. |
| | [INFO] Pyramids. Step: 90000. Time Elapsed: 136.319 s. Mean Reward: -0.797. Std of Reward: 0.646. Training. |
| | [INFO] Pyramids. Step: 120000. Time Elapsed: 182.893 s. Mean Reward: -0.831. Std of Reward: 0.654. Training. |
| | [INFO] Pyramids. Step: 150000. Time Elapsed: 227.995 s. Mean Reward: -0.715. Std of Reward: 0.760. Training. |
| | [INFO] Pyramids. Step: 180000. Time Elapsed: 270.527 s. Mean Reward: -0.731. Std of Reward: 0.712. Training. |
| | [INFO] Pyramids. Step: 210000. Time Elapsed: 316.617 s. Mean Reward: -0.699. Std of Reward: 0.810. Training. |
| | [INFO] Pyramids. Step: 240000. Time Elapsed: 361.434 s. Mean Reward: -0.640. Std of Reward: 0.822. Training. |
| | [INFO] Pyramids. Step: 270000. Time Elapsed: 407.787 s. Mean Reward: -0.520. Std of Reward: 0.969. Training. |
| | [INFO] Pyramids. Step: 300000. Time Elapsed: 451.612 s. Mean Reward: -0.222. Std of Reward: 1.135. Training. |
| | [INFO] Pyramids. Step: 330000. Time Elapsed: 496.996 s. Mean Reward: -0.328. Std of Reward: 1.124. Training. |
| | [INFO] Pyramids. Step: 360000. Time Elapsed: 541.248 s. Mean Reward: -0.452. Std of Reward: 0.995. Training. |
| | [INFO] Pyramids. Step: 390000. Time Elapsed: 587.186 s. Mean Reward: -0.411. Std of Reward: 1.044. Training. |
| | [INFO] Pyramids. Step: 420000. Time Elapsed: 630.923 s. Mean Reward: -0.042. Std of Reward: 1.228. Training. |
| | [INFO] Pyramids. Step: 450000. Time Elapsed: 675.866 s. Mean Reward: 0.009. Std of Reward: 1.237. Training. |
| | [INFO] Pyramids. Step: 480000. Time Elapsed: 721.391 s. Mean Reward: 0.351. Std of Reward: 1.271. Training. |
| | [INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-499992.onnx |
| | [INFO] Pyramids. Step: 510000. Time Elapsed: 767.344 s. Mean Reward: 0.647. Std of Reward: 1.140. Training. |
| | [INFO] Pyramids. Step: 540000. Time Elapsed: 812.656 s. Mean Reward: 0.526. Std of Reward: 1.178. Training. |
| | [INFO] Pyramids. Step: 570000. Time Elapsed: 857.156 s. Mean Reward: 0.525. Std of Reward: 1.236. Training. |
| | [INFO] Pyramids. Step: 600000. Time Elapsed: 900.647 s. Mean Reward: 0.979. Std of Reward: 0.977. Training. |
| | [INFO] Pyramids. Step: 630000. Time Elapsed: 949.947 s. Mean Reward: 1.044. Std of Reward: 1.040. Training. |
| | [INFO] Pyramids. Step: 660000. Time Elapsed: 1006.810 s. Mean Reward: 1.143. Std of Reward: 0.937. Training. |
| | [INFO] Pyramids. Step: 690000. Time Elapsed: 1062.833 s. Mean Reward: 1.151. Std of Reward: 0.997. Training. |
| | [INFO] Pyramids. Step: 720000. Time Elapsed: 1119.948 s. Mean Reward: 1.499. Std of Reward: 0.563. Training. |
| | [INFO] Pyramids. Step: 750000. Time Elapsed: 1178.547 s. Mean Reward: 1.308. Std of Reward: 0.835. Training. |
| | [INFO] Pyramids. Step: 780000. Time Elapsed: 1226.204 s. Mean Reward: 1.278. Std of Reward: 0.866. Training. |
| | [INFO] Pyramids. Step: 810000. Time Elapsed: 1275.499 s. Mean Reward: 1.318. Std of Reward: 0.856. Training. |
| | [INFO] Pyramids. Step: 840000. Time Elapsed: 1322.302 s. Mean Reward: 1.477. Std of Reward: 0.641. Training. |
| | [INFO] Pyramids. Step: 870000. Time Elapsed: 1370.429 s. Mean Reward: 1.367. Std of Reward: 0.816. Training. |
| | [INFO] Pyramids. Step: 900000. Time Elapsed: 1418.228 s. Mean Reward: 1.471. Std of Reward: 0.689. Training. |
| | [INFO] Pyramids. Step: 930000. Time Elapsed: 1465.721 s. Mean Reward: 1.514. Std of Reward: 0.619. Training. |
| | [INFO] Pyramids. Step: 960000. Time Elapsed: 1513.116 s. Mean Reward: 1.403. Std of Reward: 0.810. Training. |
| | [INFO] Pyramids. Step: 990000. Time Elapsed: 1563.057 s. Mean Reward: 1.544. Std of Reward: 0.666. Training. |
| | [INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-999909.onnx |
| | [INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx |
| | [INFO] Copied results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx to results/PyramidsGPUTest/Pyramids.onnx. |
| | ``` |
| |
|
| | β
Model exported to `Pyramids.onnx` after reaching max steps. |
| |
|
| | --- |
| |
|
| | ## π₯οΈ Training Setup |
| |
|
| | - **Run ID**: `PyramidsGPUTest` |
| | - **GPU**: NVIDIA A100 80GB PCIe |
| | - **Training time**: ~26 minutes |
| | - **ML-Agents Envs**: v1.2.0.dev0 |
| | - **Communicator API**: v1.5.0 |
| |
|
| | --- |
| |
|
| | ## π Repository Contents |
| |
|
| | | File / Folder | Description | |
| | |------------------------|----------------------------------------------| |
| | | `Pyramids.onnx` | Exported trained PPO agent | |
| | | `configuration.yaml` | Full PPO + RND training config | |
| | | `run_logs/` | Training logs from ML-Agents | |
| | | `Pyramids/` | Environment-specific output folder | |
| | | `config.json` | Metadata for Hugging Face model card | |
| |
|
| | --- |
| |
|
| | ## π Citation |
| |
|
| | If you use this model, please consider citing: |
| |
|
| | ``` |
| | @misc{ppoPyramidsJetfan, |
| | author = {Jingfan Xin}, |
| | title = {PPO Agent Trained on Unity Pyramids Environment}, |
| | year = {2025}, |
| | howpublished = {\url{https://huggingface.co/jetfan-xin/ppo-Pyramids}}, |
| | } |
| | ``` |
| |
|