Update README.md

8af299c verified 8 months ago

12.1 kB

	---
	library_name: ml-agents
	tags:
	- Pyramids
	- deep-reinforcement-learning
	- reinforcement-learning
	- ML-Agents-Pyramids
	---

	# ppo Agent playing Pyramids
	This is a trained model of a ppo agent playing Pyramids
	using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).

	## Usage (with ML-Agents)
	The Documentation: https://unity-technologies.github.io/ml-agents/ML-Agents-Toolkit-Documentation/

	We wrote a complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub:
	- A short tutorial where you teach Huggy the Dog 🐶 to fetch the stick and then play with him directly in your
	browser: https://huggingface.co/learn/deep-rl-course/unitbonus1/introduction
	- A longer tutorial to understand how works ML-Agents:
	https://huggingface.co/learn/deep-rl-course/unit5/introduction

	### Resume the training
	```bash
	mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
	```

	### Watch your Agent play
	You can watch your agent playing directly in your browser

	1. If the environment is part of ML-Agents official environments, go to https://huggingface.co/unity
	2. Step 1: Find your model_id: jetfan-xin/ppo-Pyramids
	3. Step 2: Select your .nn /.onnx file
	4. Click on Watch the agent play 👀


	# 🧠 PPO Agent Trained on Unity Pyramids Environment

	This repository contains a reinforcement learning agent trained using Proximal Policy Optimization (PPO) on Unity’s Pyramids environment via ML-Agents.

	## 📌 Model Overview

	- Algorithm: PPO with RND (Random Network Distillation)
	- Environment: Unity Pyramids (3D sparse-reward maze)
	- Framework: ML-Agents v1.2.0.dev0
	- Backend: PyTorch 2.7.1 (CUDA-enabled)

	The agent learns to navigate a 3D maze and reach the goal area by combining extrinsic and intrinsic rewards.

	---

	## 🚀 How to Use This Model

	You can use the `.onnx` model directly in Unity.

	### ✅ Steps:

	1. Download the model

	Clone the repository or download `Pyramids.onnx`:

	```bash
	git lfs install
	git clone https://huggingface.co/jetfan-xin/ppo-Pyramids
	```

	2. Place in Unity project

	Put the model file in your Unity project under:

	```
	Assets/ML-Agents/Examples/Pyramids/Pyramids.onnx
	```

	3. Assign in Unity Editor

	- Select your agent GameObject.
	- In `Behavior Parameters`, assign `Pyramids.onnx` as the model.
	- Make sure the Behavior Name matches your training config.

	---

	## ⚙️ Training Configuration

	Key settings from `configuration.yaml`:

	- `trainer_type`: `ppo`
	- `max_steps`: `1000000`
	- `batch_size`: `128`, `buffer_size`: `2048`
	- `learning_rate`: `3e-4`
	- `reward_signals`:
	- `extrinsic`: γ=0.99, strength=1.0
	- `rnd`: γ=0.99, strength=0.01
	- `hidden_units`: `512`, `num_layers`: `2`
	- `summary_freq`: `30000`

	See `configuration.yaml` for full details.

	---

	## 📈 Training Performance

	Sample rewards from training log:

	\| Step \| Mean Reward \|
	\|-----------\|-------------\|
	\| 300,000 \| -0.22 \|
	\| 480,000 \| 0.35 \|
	\| 660,000 \| 1.14 \|
	\| 840,000 \| 1.47 \|
	\| 990,000 \| 1.54 \|

	details:
	```
	(rl_py310) 4xin@ltgpu3:~/deep_rl/unit5/ml-agents$ CUDA_VISIBLE_DEVICES=3 mlagents-learn ./config/ppo/PyramidsRND.yaml \
	--env=./training-envs-executables/linux/Pyramids/Pyramids.x86_64 \
	--run-id="PyramidsGPUTest" \
	--no-graphics

	┐ ╖
	╓╖╬│╡ ││╬╖╖
	╓╖╬│││││┘ ╬│││││╬╖
	╖╬│││││╬╜ ╙╬│││││╖╖ ╗╗╗
	╬╬╬╬╖││╦╖ ╖╬││╗╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╜╜╜ ╟╣╣
	╬╬╬╬╬╬╬╬╖│╬╖╖╓╬╪│╓╣╣╣╣╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╒╣╣╖╗╣╣╣╗ ╣╣╣ ╣╣╣╣╣╣ ╟╣╣╖ ╣╣╣
	╬╬╬╬┐ ╙╬╬╬╬│╓╣╣╣╝╜ ╫╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╟╣╣╣╙ ╙╣╣╣ ╣╣╣ ╙╟╣╣╜╙ ╫╣╣ ╟╣╣
	╬╬╬╬┐ ╙╬╬╣╣ ╫╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╟╣╣╬ ╣╣╣ ╣╣╣ ╟╣╣ ╣╣╣┌╣╣╜
	╬╬╬╜ ╬╬╣╣ ╙╝╣╣╬ ╙╣╣╣╗╖╓╗╣╣╣╜ ╟╣╣╬ ╣╣╣ ╣╣╣ ╟╣╣╦╓ ╣╣╣╣╣
	╙ ╓╦╖ ╬╬╣╣ ╓╗╗╖ ╙╝╣╣╣╣╝╜ ╘╝╝╜ ╝╝╝ ╝╝╝ ╙╣╣╣ ╟╣╣╣
	╩╬╬╬╬╬╬╦╦╬╬╣╣╗╣╣╣╣╣╣╣╝ ╫╣╣╣╣
	╙╬╬╬╬╬╬╬╣╣╣╣╣╣╝╜
	╙╬╬╬╣╣╣╜
	╙

	Version information:
	ml-agents: 1.2.0.dev0,
	ml-agents-envs: 1.2.0.dev0,
	Communicator API: 1.5.0,
	PyTorch: 2.7.1+cu126
	[INFO] Connected to Unity environment with package version 2.2.1-exp.1 and communication version 1.5.0
	[INFO] Connected new brain: Pyramids?team=0
	[INFO] Hyperparameters for behavior name Pyramids:
	trainer_type: ppo
	hyperparameters:
	batch_size: 128
	buffer_size: 2048
	learning_rate: 0.0003
	beta: 0.01
	epsilon: 0.2
	lambd: 0.95
	num_epoch: 3
	shared_critic: False
	learning_rate_schedule: linear
	beta_schedule: linear
	epsilon_schedule: linear
	checkpoint_interval: 500000
	network_settings:
	normalize: False
	hidden_units: 512
	num_layers: 2
	vis_encode_type: simple
	memory: None
	goal_conditioning_type: hyper
	deterministic: False
	reward_signals:
	extrinsic:
	gamma: 0.99
	strength: 1.0
	network_settings:
	normalize: False
	hidden_units: 128
	num_layers: 2
	vis_encode_type: simple
	memory: None
	goal_conditioning_type: hyper
	deterministic: False
	rnd:
	gamma: 0.99
	strength: 0.01
	network_settings:
	normalize: False
	hidden_units: 64
	num_layers: 3
	vis_encode_type: simple
	memory: None
	goal_conditioning_type: hyper
	deterministic: False
	learning_rate: 0.0001
	encoding_size: None
	init_path: None
	keep_checkpoints: 5
	even_checkpoints: False
	max_steps: 1000000
	time_horizon: 128
	summary_freq: 30000
	threaded: False
	self_play: None
	behavioral_cloning: None
	[INFO] Pyramids. Step: 30000. Time Elapsed: 45.356 s. Mean Reward: -1.000. Std of Reward: 0.000. Training.
	[INFO] Pyramids. Step: 60000. Time Elapsed: 90.519 s. Mean Reward: -0.853. Std of Reward: 0.588. Training.
	[INFO] Pyramids. Step: 90000. Time Elapsed: 136.319 s. Mean Reward: -0.797. Std of Reward: 0.646. Training.
	[INFO] Pyramids. Step: 120000. Time Elapsed: 182.893 s. Mean Reward: -0.831. Std of Reward: 0.654. Training.
	[INFO] Pyramids. Step: 150000. Time Elapsed: 227.995 s. Mean Reward: -0.715. Std of Reward: 0.760. Training.
	[INFO] Pyramids. Step: 180000. Time Elapsed: 270.527 s. Mean Reward: -0.731. Std of Reward: 0.712. Training.
	[INFO] Pyramids. Step: 210000. Time Elapsed: 316.617 s. Mean Reward: -0.699. Std of Reward: 0.810. Training.
	[INFO] Pyramids. Step: 240000. Time Elapsed: 361.434 s. Mean Reward: -0.640. Std of Reward: 0.822. Training.
	[INFO] Pyramids. Step: 270000. Time Elapsed: 407.787 s. Mean Reward: -0.520. Std of Reward: 0.969. Training.
	[INFO] Pyramids. Step: 300000. Time Elapsed: 451.612 s. Mean Reward: -0.222. Std of Reward: 1.135. Training.
	[INFO] Pyramids. Step: 330000. Time Elapsed: 496.996 s. Mean Reward: -0.328. Std of Reward: 1.124. Training.
	[INFO] Pyramids. Step: 360000. Time Elapsed: 541.248 s. Mean Reward: -0.452. Std of Reward: 0.995. Training.
	[INFO] Pyramids. Step: 390000. Time Elapsed: 587.186 s. Mean Reward: -0.411. Std of Reward: 1.044. Training.
	[INFO] Pyramids. Step: 420000. Time Elapsed: 630.923 s. Mean Reward: -0.042. Std of Reward: 1.228. Training.
	[INFO] Pyramids. Step: 450000. Time Elapsed: 675.866 s. Mean Reward: 0.009. Std of Reward: 1.237. Training.
	[INFO] Pyramids. Step: 480000. Time Elapsed: 721.391 s. Mean Reward: 0.351. Std of Reward: 1.271. Training.
	[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-499992.onnx
	[INFO] Pyramids. Step: 510000. Time Elapsed: 767.344 s. Mean Reward: 0.647. Std of Reward: 1.140. Training.
	[INFO] Pyramids. Step: 540000. Time Elapsed: 812.656 s. Mean Reward: 0.526. Std of Reward: 1.178. Training.
	[INFO] Pyramids. Step: 570000. Time Elapsed: 857.156 s. Mean Reward: 0.525. Std of Reward: 1.236. Training.
	[INFO] Pyramids. Step: 600000. Time Elapsed: 900.647 s. Mean Reward: 0.979. Std of Reward: 0.977. Training.
	[INFO] Pyramids. Step: 630000. Time Elapsed: 949.947 s. Mean Reward: 1.044. Std of Reward: 1.040. Training.
	[INFO] Pyramids. Step: 660000. Time Elapsed: 1006.810 s. Mean Reward: 1.143. Std of Reward: 0.937. Training.
	[INFO] Pyramids. Step: 690000. Time Elapsed: 1062.833 s. Mean Reward: 1.151. Std of Reward: 0.997. Training.
	[INFO] Pyramids. Step: 720000. Time Elapsed: 1119.948 s. Mean Reward: 1.499. Std of Reward: 0.563. Training.
	[INFO] Pyramids. Step: 750000. Time Elapsed: 1178.547 s. Mean Reward: 1.308. Std of Reward: 0.835. Training.
	[INFO] Pyramids. Step: 780000. Time Elapsed: 1226.204 s. Mean Reward: 1.278. Std of Reward: 0.866. Training.
	[INFO] Pyramids. Step: 810000. Time Elapsed: 1275.499 s. Mean Reward: 1.318. Std of Reward: 0.856. Training.
	[INFO] Pyramids. Step: 840000. Time Elapsed: 1322.302 s. Mean Reward: 1.477. Std of Reward: 0.641. Training.
	[INFO] Pyramids. Step: 870000. Time Elapsed: 1370.429 s. Mean Reward: 1.367. Std of Reward: 0.816. Training.
	[INFO] Pyramids. Step: 900000. Time Elapsed: 1418.228 s. Mean Reward: 1.471. Std of Reward: 0.689. Training.
	[INFO] Pyramids. Step: 930000. Time Elapsed: 1465.721 s. Mean Reward: 1.514. Std of Reward: 0.619. Training.
	[INFO] Pyramids. Step: 960000. Time Elapsed: 1513.116 s. Mean Reward: 1.403. Std of Reward: 0.810. Training.
	[INFO] Pyramids. Step: 990000. Time Elapsed: 1563.057 s. Mean Reward: 1.544. Std of Reward: 0.666. Training.
	[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-999909.onnx
	[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx
	[INFO] Copied results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx to results/PyramidsGPUTest/Pyramids.onnx.
	```

	✅ Model exported to `Pyramids.onnx` after reaching max steps.

	---

	## 🖥️ Training Setup

	- Run ID: `PyramidsGPUTest`
	- GPU: NVIDIA A100 80GB PCIe
	- Training time: ~26 minutes
	- ML-Agents Envs: v1.2.0.dev0
	- Communicator API: v1.5.0

	---

	## 📁 Repository Contents

	\| File / Folder \| Description \|
	\|------------------------\|----------------------------------------------\|
	\| `Pyramids.onnx` \| Exported trained PPO agent \|
	\| `configuration.yaml` \| Full PPO + RND training config \|
	\| `run_logs/` \| Training logs from ML-Agents \|
	\| `Pyramids/` \| Environment-specific output folder \|
	\| `config.json` \| Metadata for Hugging Face model card \|

	---

	## 📚 Citation

	If you use this model, please consider citing:

	```
	@misc{ppoPyramidsJetfan,
	author = {Jingfan Xin},
	title = {PPO Agent Trained on Unity Pyramids Environment},
	year = {2025},
	howpublished = {\url{https://huggingface.co/jetfan-xin/ppo-Pyramids}},
	}
	```