File size: 8,201 Bytes
c0f187d cf89c9b c0f187d cf89c9b c0f187d cf89c9b c0f187d cf89c9b c0f187d cf89c9b c0f187d cf89c9b c0f187d cf89c9b c0f187d cf89c9b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
---
library_name: ml-agents
tags:
- Pyramids
- deep-reinforcement-learning
- reinforcement-learning
- ML-Agents-Pyramids
---
# PPO-Pyramids Unity ML-Agents Model
## Model Description
This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures.
## Model Details
### Model Architecture
- **Algorithm**: Proximal Policy Optimization (PPO)
- **Framework**: Unity ML-Agents with PyTorch backend
- **Policy Type**: Actor-Critic with shared feature extraction
- **Network Architecture**:
- Hidden Units: 512 per layer
- Number of Layers: 2
- Activation: ReLU (default)
- Normalization: Disabled
- Visual Encoding: Simple CNN for visual observations
### Environment: Pyramids
The Pyramids environment is one of Unity ML-Agents' example environments featuring:
- **Objective**: Navigate to randomly spawned goal locations
- **Setting**: 3D pyramid-like structures with multiple levels and obstacles
- **Complexity**: Multi-agent environment with navigation and spatial reasoning challenges
- **Visual Component**: First-person or third-person visual observations
## Training Configuration
### PPO Hyperparameters
```yaml
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01 # Entropy regularization
epsilon: 0.2 # PPO clipping parameter
lambda: 0.95 # GAE parameter
num_epoch: 3 # Training epochs per update
learning_rate_schedule: linear
```
### Network Settings
```yaml
normalize: false # Input normalization
hidden_units: 512 # Units per hidden layer
num_layers: 2 # Number of hidden layers
vis_encode_type: simple # Visual encoder type
```
### Reward Structure
- **Extrinsic Rewards**:
- Gamma: 0.99 (discount factor)
- Strength: 1.0
- Sparse rewards for reaching goals
- Time penalties for efficiency
- **Intrinsic Rewards (RND)**:
- Random Network Distillation for exploration
- Gamma: 0.99
- Strength: 0.01
- Separate network: 64 units, 3 layers
- Learning rate: 0.0001
### Training Process
- **Max Steps**: 1,000,000 training steps
- **Time Horizon**: 128 steps per trajectory
- **Checkpoints**: Keep 5 best models
- **Summary Frequency**: Every 30,000 steps
- **Training Time**: Approximately 4-8 hours on modern GPU
## Observation Space
The agent receives:
- **Visual Observations**: RGB camera input (84x84x3 typically)
- **Vector Observations**: Agent position, rotation, velocity
- **Goal Information**: Relative goal position and distance
- **Environmental Context**: Obstacle proximity, platform information
## Action Space
- **Action Type**: Continuous
- **Action Dimensions**: 3-4 continuous values
- Forward/backward movement
- Left/right movement
- Rotation (yaw)
- Optional: Jump action
## Performance Metrics
### Expected Performance
- **Goal Reaching Success Rate**: 80-95%
- **Average Episode Length**: Optimal path finding
- **Training Convergence**: Stable improvement over 1M steps
- **Exploration Efficiency**: Balanced exploration vs exploitation
### Key Metrics Tracked
- **Cumulative Reward**: Total reward per episode
- **Success Rate**: Percentage of episodes reaching goal
- **Episode Length**: Steps to complete episode
- **Policy Entropy**: Measure of action diversity
- **Value Function Accuracy**: Critic network performance
## Technical Implementation
### PPO Algorithm Features
- **Policy Clipping**: Prevents destructive policy updates (Ξ΅=0.2)
- **Generalized Advantage Estimation**: GAE with Ξ»=0.95
- **Entropy Regularization**: Encourages exploration (Ξ²=0.01)
- **Value Function Learning**: Shared network with policy
### Random Network Distillation (RND)
- **Purpose**: Intrinsic motivation for exploration
- **Implementation**: Separate predictor and target networks
- **Benefit**: Encourages visiting novel states
- **Balance**: Low strength (0.01) to avoid overwhelming extrinsic rewards
### Unity ML-Agents Integration
- **Training Interface**: Python mlagents-learn command
- **Environment Communication**: Unity-Python API
- **Parallel Training**: Multiple environment instances
- **Real-time Monitoring**: TensorBoard integration
## Files and Structure
```
βββ Pyramids.onnx # Trained policy network
βββ Pyramids/
β βββ checkpoint-{step}.onnx # Training checkpoints
β βββ configuration.yaml # Training configuration
β βββ run_logs/ # Training metrics
βββ results/
β βββ training_summary.json # Training statistics
β βββ tensorboard_logs/ # TensorBoard data
```
## Usage
### Loading the Model
```python
from mlagents_envs import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel
# Load environment
channel = EngineConfigurationChannel()
env = UnityEnvironment(file_name="Pyramids", side_channels=[channel])
# Model is loaded automatically when using mlagents-learn
```
### Training Command
```bash
mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01
```
### Resume the training
```bash
mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
```
### Inference
```python
# The trained model can be used directly in Unity builds
# or through the ML-Agents Python API for evaluation
```
## Limitations and Considerations
1. **Environment Specific**: Trained specifically for Pyramids environment layout
2. **Visual Dependency**: Performance tied to visual observation quality
3. **Exploration Balance**: RND parameters may need tuning for different scenarios
4. **Computational Requirements**: Requires GPU for efficient training
5. **Generalization**: May not transfer well to significantly different navigation tasks
## Optimization Suggestions
For improved performance, consider:
- **Enable normalization**: `normalize: true`
- **Increase network capacity**: `hidden_units: 768`
- **Longer time horizon**: `time_horizon: 256`
- **Higher batch size**: `batch_size: 256`
- **More training steps**: `max_steps: 2000000`
## Applications
- **Game AI**: Intelligent NPC navigation in 3D games
- **Robotics Research**: Transfer learning for robot navigation
- **Pathfinding**: Advanced pathfinding algorithm development
- **Educational**: Demonstration of RL in complex 3D environments
## Ethical Considerations
This model represents a benign navigation task with no ethical concerns:
- **Content**: Abstract geometric environment
- **Purpose**: Educational and research applications
- **Safety**: No real-world safety implications
## System Requirements
### Training
- **OS**: Windows 10+, macOS 10.14+, Ubuntu 18.04+
- **GPU**: NVIDIA GPU with CUDA support (recommended)
- **RAM**: 8GB minimum, 16GB recommended
- **Storage**: 2GB for environment and model files
### Dependencies
```
unity-ml-agents>=0.28.0
torch>=1.8.0
tensorboard
numpy
```
## Citation
If you use this model, please cite:
```bibtex
@misc{ppo-pyramids-2024,
title={PPO-Pyramids: Navigation Agent for Unity ML-Agents},
author={Adilbai},
year={2024},
publisher={Hugging Face Hub},
url={https://huggingface.co/Adilbai/ppo-pyramids}
}
```
## References
- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
- Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894
- Unity Technologies. ML-Agents Toolkit Documentation
- Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627
## Training Logs and Monitoring
Monitor training progress through:
- **TensorBoard**: Real-time training metrics
- **Console Output**: Episode rewards and statistics
- **Checkpoint Analysis**: Model performance over time
- **Success Rate Tracking**: Goal completion percentage
---
*For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. ποΈπ―* |