File size: 8,201 Bytes

---
library_name: ml-agents
tags:
- Pyramids
- deep-reinforcement-learning
- reinforcement-learning
- ML-Agents-Pyramids
---
# PPO-Pyramids Unity ML-Agents Model

## Model Description

This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures.

## Model Details

### Model Architecture
- **Algorithm**: Proximal Policy Optimization (PPO)
- **Framework**: Unity ML-Agents with PyTorch backend
- **Policy Type**: Actor-Critic with shared feature extraction
- **Network Architecture**:
  - Hidden Units: 512 per layer
  - Number of Layers: 2
  - Activation: ReLU (default)
  - Normalization: Disabled
  - Visual Encoding: Simple CNN for visual observations

### Environment: Pyramids

The Pyramids environment is one of Unity ML-Agents' example environments featuring:
- **Objective**: Navigate to randomly spawned goal locations
- **Setting**: 3D pyramid-like structures with multiple levels and obstacles
- **Complexity**: Multi-agent environment with navigation and spatial reasoning challenges
- **Visual Component**: First-person or third-person visual observations

## Training Configuration

### PPO Hyperparameters
```yaml
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01                    # Entropy regularization
epsilon: 0.2                  # PPO clipping parameter
lambda: 0.95                  # GAE parameter
num_epoch: 3                  # Training epochs per update
learning_rate_schedule: linear
```

### Network Settings
```yaml
normalize: false              # Input normalization
hidden_units: 512            # Units per hidden layer
num_layers: 2                # Number of hidden layers
vis_encode_type: simple      # Visual encoder type
```

### Reward Structure
- **Extrinsic Rewards**:
  - Gamma: 0.99 (discount factor)
  - Strength: 1.0
  - Sparse rewards for reaching goals
  - Time penalties for efficiency

- **Intrinsic Rewards (RND)**:
  - Random Network Distillation for exploration
  - Gamma: 0.99
  - Strength: 0.01
  - Separate network: 64 units, 3 layers
  - Learning rate: 0.0001

### Training Process
- **Max Steps**: 1,000,000 training steps
- **Time Horizon**: 128 steps per trajectory
- **Checkpoints**: Keep 5 best models
- **Summary Frequency**: Every 30,000 steps
- **Training Time**: Approximately 4-8 hours on modern GPU

## Observation Space

The agent receives:
- **Visual Observations**: RGB camera input (84x84x3 typically)
- **Vector Observations**: Agent position, rotation, velocity
- **Goal Information**: Relative goal position and distance
- **Environmental Context**: Obstacle proximity, platform information

## Action Space

- **Action Type**: Continuous
- **Action Dimensions**: 3-4 continuous values
  - Forward/backward movement
  - Left/right movement  
  - Rotation (yaw)
  - Optional: Jump action

## Performance Metrics

### Expected Performance
- **Goal Reaching Success Rate**: 80-95%
- **Average Episode Length**: Optimal path finding
- **Training Convergence**: Stable improvement over 1M steps
- **Exploration Efficiency**: Balanced exploration vs exploitation

### Key Metrics Tracked
- **Cumulative Reward**: Total reward per episode
- **Success Rate**: Percentage of episodes reaching goal
- **Episode Length**: Steps to complete episode
- **Policy Entropy**: Measure of action diversity
- **Value Function Accuracy**: Critic network performance

## Technical Implementation

### PPO Algorithm Features
- **Policy Clipping**: Prevents destructive policy updates (ε=0.2)
- **Generalized Advantage Estimation**: GAE with λ=0.95
- **Entropy Regularization**: Encourages exploration (β=0.01)
- **Value Function Learning**: Shared network with policy

### Random Network Distillation (RND)
- **Purpose**: Intrinsic motivation for exploration
- **Implementation**: Separate predictor and target networks
- **Benefit**: Encourages visiting novel states
- **Balance**: Low strength (0.01) to avoid overwhelming extrinsic rewards

### Unity ML-Agents Integration
- **Training Interface**: Python mlagents-learn command
- **Environment Communication**: Unity-Python API
- **Parallel Training**: Multiple environment instances
- **Real-time Monitoring**: TensorBoard integration

## Files and Structure

```
├── Pyramids.onnx              # Trained policy network
├── Pyramids/
│   ├── checkpoint-{step}.onnx # Training checkpoints  
│   ├── configuration.yaml     # Training configuration
│   └── run_logs/             # Training metrics
├── results/
│   ├── training_summary.json # Training statistics
│   └── tensorboard_logs/     # TensorBoard data
```

## Usage

### Loading the Model
```python
from mlagents_envs import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel

# Load environment
channel = EngineConfigurationChannel()
env = UnityEnvironment(file_name="Pyramids", side_channels=[channel])

# Model is loaded automatically when using mlagents-learn
```

### Training Command
```bash
mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01
```
### Resume the training
  ```bash
  mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
  ```  
### Inference
```python
# The trained model can be used directly in Unity builds
# or through the ML-Agents Python API for evaluation
```

## Limitations and Considerations

1. **Environment Specific**: Trained specifically for Pyramids environment layout
2. **Visual Dependency**: Performance tied to visual observation quality
3. **Exploration Balance**: RND parameters may need tuning for different scenarios
4. **Computational Requirements**: Requires GPU for efficient training
5. **Generalization**: May not transfer well to significantly different navigation tasks

## Optimization Suggestions

For improved performance, consider:
- **Enable normalization**: `normalize: true`
- **Increase network capacity**: `hidden_units: 768`
- **Longer time horizon**: `time_horizon: 256`  
- **Higher batch size**: `batch_size: 256`
- **More training steps**: `max_steps: 2000000`

## Applications

- **Game AI**: Intelligent NPC navigation in 3D games
- **Robotics Research**: Transfer learning for robot navigation
- **Pathfinding**: Advanced pathfinding algorithm development
- **Educational**: Demonstration of RL in complex 3D environments

## Ethical Considerations

This model represents a benign navigation task with no ethical concerns:
- **Content**: Abstract geometric environment
- **Purpose**: Educational and research applications
- **Safety**: No real-world safety implications

## System Requirements

### Training
- **OS**: Windows 10+, macOS 10.14+, Ubuntu 18.04+
- **GPU**: NVIDIA GPU with CUDA support (recommended)
- **RAM**: 8GB minimum, 16GB recommended
- **Storage**: 2GB for environment and model files

### Dependencies
```
unity-ml-agents>=0.28.0
torch>=1.8.0
tensorboard
numpy
```

## Citation

If you use this model, please cite:

```bibtex
@misc{ppo-pyramids-2024,
  title={PPO-Pyramids: Navigation Agent for Unity ML-Agents},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face Hub},
  url={https://huggingface.co/Adilbai/ppo-pyramids}
}
```

## References

- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
- Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894
- Unity Technologies. ML-Agents Toolkit Documentation
- Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627

## Training Logs and Monitoring

Monitor training progress through:
- **TensorBoard**: Real-time training metrics
- **Console Output**: Episode rewards and statistics  
- **Checkpoint Analysis**: Model performance over time
- **Success Rate Tracking**: Goal completion percentage

---

*For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. 🏗️🎯*