File size: 8,201 Bytes
c0f187d
 
 
 
 
 
 
 
cf89c9b
c0f187d
cf89c9b
c0f187d
cf89c9b
c0f187d
cf89c9b
c0f187d
cf89c9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0f187d
 
cf89c9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0f187d
cf89c9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0f187d
cf89c9b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
library_name: ml-agents
tags:
- Pyramids
- deep-reinforcement-learning
- reinforcement-learning
- ML-Agents-Pyramids
---
# PPO-Pyramids Unity ML-Agents Model

## Model Description

This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures.

## Model Details

### Model Architecture
- **Algorithm**: Proximal Policy Optimization (PPO)
- **Framework**: Unity ML-Agents with PyTorch backend
- **Policy Type**: Actor-Critic with shared feature extraction
- **Network Architecture**:
  - Hidden Units: 512 per layer
  - Number of Layers: 2
  - Activation: ReLU (default)
  - Normalization: Disabled
  - Visual Encoding: Simple CNN for visual observations

### Environment: Pyramids

The Pyramids environment is one of Unity ML-Agents' example environments featuring:
- **Objective**: Navigate to randomly spawned goal locations
- **Setting**: 3D pyramid-like structures with multiple levels and obstacles
- **Complexity**: Multi-agent environment with navigation and spatial reasoning challenges
- **Visual Component**: First-person or third-person visual observations

## Training Configuration

### PPO Hyperparameters
```yaml
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01                    # Entropy regularization
epsilon: 0.2                  # PPO clipping parameter
lambda: 0.95                  # GAE parameter
num_epoch: 3                  # Training epochs per update
learning_rate_schedule: linear
```

### Network Settings
```yaml
normalize: false              # Input normalization
hidden_units: 512            # Units per hidden layer
num_layers: 2                # Number of hidden layers
vis_encode_type: simple      # Visual encoder type
```

### Reward Structure
- **Extrinsic Rewards**:
  - Gamma: 0.99 (discount factor)
  - Strength: 1.0
  - Sparse rewards for reaching goals
  - Time penalties for efficiency

- **Intrinsic Rewards (RND)**:
  - Random Network Distillation for exploration
  - Gamma: 0.99
  - Strength: 0.01
  - Separate network: 64 units, 3 layers
  - Learning rate: 0.0001

### Training Process
- **Max Steps**: 1,000,000 training steps
- **Time Horizon**: 128 steps per trajectory
- **Checkpoints**: Keep 5 best models
- **Summary Frequency**: Every 30,000 steps
- **Training Time**: Approximately 4-8 hours on modern GPU

## Observation Space

The agent receives:
- **Visual Observations**: RGB camera input (84x84x3 typically)
- **Vector Observations**: Agent position, rotation, velocity
- **Goal Information**: Relative goal position and distance
- **Environmental Context**: Obstacle proximity, platform information

## Action Space

- **Action Type**: Continuous
- **Action Dimensions**: 3-4 continuous values
  - Forward/backward movement
  - Left/right movement  
  - Rotation (yaw)
  - Optional: Jump action

## Performance Metrics

### Expected Performance
- **Goal Reaching Success Rate**: 80-95%
- **Average Episode Length**: Optimal path finding
- **Training Convergence**: Stable improvement over 1M steps
- **Exploration Efficiency**: Balanced exploration vs exploitation

### Key Metrics Tracked
- **Cumulative Reward**: Total reward per episode
- **Success Rate**: Percentage of episodes reaching goal
- **Episode Length**: Steps to complete episode
- **Policy Entropy**: Measure of action diversity
- **Value Function Accuracy**: Critic network performance

## Technical Implementation

### PPO Algorithm Features
- **Policy Clipping**: Prevents destructive policy updates (Ξ΅=0.2)
- **Generalized Advantage Estimation**: GAE with Ξ»=0.95
- **Entropy Regularization**: Encourages exploration (Ξ²=0.01)
- **Value Function Learning**: Shared network with policy

### Random Network Distillation (RND)
- **Purpose**: Intrinsic motivation for exploration
- **Implementation**: Separate predictor and target networks
- **Benefit**: Encourages visiting novel states
- **Balance**: Low strength (0.01) to avoid overwhelming extrinsic rewards

### Unity ML-Agents Integration
- **Training Interface**: Python mlagents-learn command
- **Environment Communication**: Unity-Python API
- **Parallel Training**: Multiple environment instances
- **Real-time Monitoring**: TensorBoard integration

## Files and Structure

```
β”œβ”€β”€ Pyramids.onnx              # Trained policy network
β”œβ”€β”€ Pyramids/
β”‚   β”œβ”€β”€ checkpoint-{step}.onnx # Training checkpoints  
β”‚   β”œβ”€β”€ configuration.yaml     # Training configuration
β”‚   └── run_logs/             # Training metrics
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ training_summary.json # Training statistics
β”‚   └── tensorboard_logs/     # TensorBoard data
```

## Usage

### Loading the Model
```python
from mlagents_envs import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel

# Load environment
channel = EngineConfigurationChannel()
env = UnityEnvironment(file_name="Pyramids", side_channels=[channel])

# Model is loaded automatically when using mlagents-learn
```

### Training Command
```bash
mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01
```
### Resume the training
  ```bash
  mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
  ```  
### Inference
```python
# The trained model can be used directly in Unity builds
# or through the ML-Agents Python API for evaluation
```

## Limitations and Considerations

1. **Environment Specific**: Trained specifically for Pyramids environment layout
2. **Visual Dependency**: Performance tied to visual observation quality
3. **Exploration Balance**: RND parameters may need tuning for different scenarios
4. **Computational Requirements**: Requires GPU for efficient training
5. **Generalization**: May not transfer well to significantly different navigation tasks

## Optimization Suggestions

For improved performance, consider:
- **Enable normalization**: `normalize: true`
- **Increase network capacity**: `hidden_units: 768`
- **Longer time horizon**: `time_horizon: 256`  
- **Higher batch size**: `batch_size: 256`
- **More training steps**: `max_steps: 2000000`

## Applications

- **Game AI**: Intelligent NPC navigation in 3D games
- **Robotics Research**: Transfer learning for robot navigation
- **Pathfinding**: Advanced pathfinding algorithm development
- **Educational**: Demonstration of RL in complex 3D environments

## Ethical Considerations

This model represents a benign navigation task with no ethical concerns:
- **Content**: Abstract geometric environment
- **Purpose**: Educational and research applications
- **Safety**: No real-world safety implications

## System Requirements

### Training
- **OS**: Windows 10+, macOS 10.14+, Ubuntu 18.04+
- **GPU**: NVIDIA GPU with CUDA support (recommended)
- **RAM**: 8GB minimum, 16GB recommended
- **Storage**: 2GB for environment and model files

### Dependencies
```
unity-ml-agents>=0.28.0
torch>=1.8.0
tensorboard
numpy
```

## Citation

If you use this model, please cite:

```bibtex
@misc{ppo-pyramids-2024,
  title={PPO-Pyramids: Navigation Agent for Unity ML-Agents},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face Hub},
  url={https://huggingface.co/Adilbai/ppo-pyramids}
}
```

## References

- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
- Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894
- Unity Technologies. ML-Agents Toolkit Documentation
- Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627

## Training Logs and Monitoring

Monitor training progress through:
- **TensorBoard**: Real-time training metrics
- **Console Output**: Episode rewards and statistics  
- **Checkpoint Analysis**: Model performance over time
- **Success Rate Tracking**: Goal completion percentage

---

*For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. πŸ—οΈπŸŽ―*