|
|
--- |
|
|
library_name: ml-agents |
|
|
tags: |
|
|
- Huggy |
|
|
- deep-reinforcement-learning |
|
|
- reinforcement-learning |
|
|
- ML-Agents-Huggy |
|
|
--- |
|
|
|
|
|
# **ppo** Agent playing **Huggy** |
|
|
This is a trained model of a **ppo** agent playing **Huggy** |
|
|
using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents). |
|
|
# Huggy PPO Agent - Training Documentation |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
**Huggy** is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps. |
|
|
|
|
|
## Training Environment |
|
|
|
|
|
- **Environment**: Unity ML-Agents custom environment "Huggy" |
|
|
- **ML-Agents Version**: 1.2.0.dev0 |
|
|
- **ML-Agents Envs**: 1.2.0.dev0 |
|
|
- **Communicator API**: 1.5.0 |
|
|
- **PyTorch Version**: 2.7.1+cu126 |
|
|
- **Unity Package Version**: 2.2.1-exp.1 |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
### PPO Hyperparameters |
|
|
- **Batch Size**: 2,048 |
|
|
- **Buffer Size**: 20,480 |
|
|
- **Learning Rate**: 0.0003 (linear schedule) |
|
|
- **Beta (entropy regularization)**: 0.005 (linear schedule) |
|
|
- **Epsilon (PPO clip parameter)**: 0.2 (linear schedule) |
|
|
- **Lambda (GAE parameter)**: 0.95 |
|
|
- **Number of Epochs**: 3 |
|
|
- **Shared Critic**: False |
|
|
|
|
|
### Network Architecture |
|
|
- **Normalization**: Enabled |
|
|
- **Hidden Units**: 512 |
|
|
- **Number of Layers**: 3 |
|
|
- **Visual Encoding Type**: Simple |
|
|
- **Memory**: None |
|
|
- **Goal Conditioning Type**: Hyper |
|
|
- **Deterministic**: False |
|
|
|
|
|
### Reward Configuration |
|
|
- **Reward Type**: Extrinsic |
|
|
- **Gamma (discount factor)**: 0.995 |
|
|
- **Reward Strength**: 1.0 |
|
|
- **Reward Network Hidden Units**: 128 |
|
|
- **Reward Network Layers**: 2 |
|
|
|
|
|
### Training Parameters |
|
|
- **Maximum Steps**: 2,000,000 |
|
|
- **Time Horizon**: 1,000 |
|
|
- **Summary Frequency**: 50,000 steps |
|
|
- **Checkpoint Interval**: 200,000 steps |
|
|
- **Keep Checkpoints**: 15 |
|
|
- **Threaded Training**: False |
|
|
|
|
|
## Training Performance |
|
|
|
|
|
### Performance Progression |
|
|
|
|
|
The agent showed steady improvement throughout training: |
|
|
|
|
|
**Early Training (0-200k steps):** |
|
|
- Step 50k: Mean Reward = 1.840 ± 0.925 |
|
|
- Step 100k: Mean Reward = 2.747 ± 1.096 |
|
|
- Step 150k: Mean Reward = 3.031 ± 1.174 |
|
|
- Step 200k: Mean Reward = 3.538 ± 1.370 |
|
|
|
|
|
**Mid Training (200k-1M steps):** |
|
|
- Performance stabilized around 3.6-3.9 mean reward |
|
|
- Peak performance at 500k steps: 3.873 ± 1.783 |
|
|
|
|
|
**Late Training (1M-2M steps):** |
|
|
- Consistent performance around 3.5-3.8 mean reward |
|
|
- Final performance at 2M steps: 3.718 ± 2.132 |
|
|
|
|
|
### Key Performance Metrics |
|
|
|
|
|
- **Training Duration**: 2,350.439 seconds (~39 minutes) |
|
|
- **Final Mean Reward**: 3.718 |
|
|
- **Final Standard Deviation**: 2.132 |
|
|
- **Peak Mean Reward**: 3.873 (at 500k steps) |
|
|
- **Lowest Standard Deviation**: 0.925 (at 50k steps) |
|
|
|
|
|
## Training Characteristics |
|
|
|
|
|
### Learning Curve Analysis |
|
|
1. **Rapid Initial Learning**: Significant improvement in first 200k steps (1.84 → 3.54) |
|
|
2. **Plateau Phase**: Performance stabilized between 200k-2M steps |
|
|
3. **Variance Increase**: Standard deviation increased over time, indicating more diverse behavior patterns |
|
|
|
|
|
### Model Checkpoints |
|
|
Regular ONNX model exports were created every 200k steps: |
|
|
- Huggy-199933.onnx |
|
|
- Huggy-399938.onnx |
|
|
- Huggy-599920.onnx |
|
|
- Huggy-799966.onnx |
|
|
- Huggy-999748.onnx |
|
|
- Huggy-1199265.onnx |
|
|
- Huggy-1399932.onnx |
|
|
- Huggy-1599985.onnx |
|
|
- Huggy-1799997.onnx |
|
|
- Huggy-1999614.onnx |
|
|
- **Final Model**: Huggy-2000364.onnx |
|
|
|
|
|
## Technical Implementation |
|
|
|
|
|
### Training Framework |
|
|
- Unity ML-Agents with PPO algorithm |
|
|
- Custom Unity environment integration |
|
|
- ONNX model export for deployment |
|
|
- Real-time training monitoring |
|
|
|
|
|
### Model Architecture Details |
|
|
- Multi-layer perceptron with 3 hidden layers |
|
|
- 512 hidden units per layer |
|
|
- Input normalization enabled |
|
|
- Separate actor-critic networks (shared_critic = False) |
|
|
- Hypernetwork goal conditioning |
|
|
|
|
|
### Reward Signal Processing |
|
|
- Single extrinsic reward signal |
|
|
- Discount factor of 0.995 for long-term planning |
|
|
- Dedicated reward network with 2 layers and 128 units |
|
|
|
|
|
## Performance Insights |
|
|
|
|
|
### Strengths |
|
|
- Consistent learning progression |
|
|
- Stable final performance around 3.7 mean reward |
|
|
- Successful completion of 2M training steps |
|
|
- Regular checkpoint generation for model versioning |
|
|
|
|
|
### Observations |
|
|
- Standard deviation increased over training, suggesting the agent learned more diverse strategies |
|
|
- Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration |
|
|
- The agent maintained stable performance without significant degradation |
|
|
|
|
|
### Training Efficiency |
|
|
- **Steps per Second**: ~851 steps/second average |
|
|
- **Episodes per Checkpoint**: Approximately 200-250 episodes per checkpoint |
|
|
- **Memory Usage**: Efficient with 20,480 buffer size and 1,000 time horizon |
|
|
|
|
|
This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics. |
|
|
# Huggy PPO Agent - Usage Guide |
|
|
|
|
|
## Prerequisites |
|
|
|
|
|
Before using the Huggy model, ensure you have the following installed: |
|
|
|
|
|
```bash |
|
|
# Install Unity ML-Agents |
|
|
pip install mlagents==1.2.0 |
|
|
|
|
|
# Install required dependencies |
|
|
pip install torch==2.7.1 |
|
|
pip install onnx |
|
|
pip install onnxruntime |
|
|
``` |
|
|
|
|
|
## Model Files |
|
|
|
|
|
After training, you'll have these key files: |
|
|
- **Huggy.onnx** - The trained model (final version) |
|
|
- **Huggy-2000364.onnx** - Final checkpoint model |
|
|
- **config.yaml** - Training configuration file |
|
|
- **training logs** - Performance metrics and tensorboard data |
|
|
|
|
|
## Loading and Using the Model |
|
|
|
|
|
### Method 1: Using ML-Agents Python API |
|
|
|
|
|
```python |
|
|
from mlagents_envs.environment import UnityEnvironment |
|
|
from mlagents_envs.base_env import ActionTuple |
|
|
import numpy as np |
|
|
|
|
|
# Load the Unity environment |
|
|
env = UnityEnvironment(file_name="path/to/your/huggy_environment") |
|
|
|
|
|
# Reset the environment |
|
|
env.reset() |
|
|
|
|
|
# Get behavior specs |
|
|
behavior_names = list(env.behavior_specs.keys()) |
|
|
behavior_name = behavior_names[0] # "Huggy" |
|
|
spec = env.behavior_specs[behavior_name] |
|
|
|
|
|
print(f"Observation space: {spec.observation_specs}") |
|
|
print(f"Action space: {spec.action_spec}") |
|
|
``` |
|
|
|
|
|
### Method 2: Using ONNX Runtime for Inference |
|
|
|
|
|
```python |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
|
|
|
# Load the trained ONNX model |
|
|
model_path = "results/Huggy2/Huggy.onnx" |
|
|
ort_session = ort.InferenceSession(model_path) |
|
|
|
|
|
# Get model input/output info |
|
|
input_name = ort_session.get_inputs()[0].name |
|
|
output_name = ort_session.get_outputs()[0].name |
|
|
|
|
|
def predict_action(observation): |
|
|
""" |
|
|
Predict action using the trained model |
|
|
""" |
|
|
# Prepare observation (ensure correct shape and normalization) |
|
|
obs_input = np.array(observation, dtype=np.float32) |
|
|
|
|
|
# Run inference |
|
|
action_probs = ort_session.run([output_name], {input_name: obs_input}) |
|
|
|
|
|
# Sample action from probabilities or take deterministic action |
|
|
action = np.argmax(action_probs[0]) # Deterministic |
|
|
# OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0]) # Stochastic |
|
|
|
|
|
return action |
|
|
``` |
|
|
|
|
|
### Method 3: Running Trained Agent in Unity |
|
|
|
|
|
```python |
|
|
from mlagents_envs.environment import UnityEnvironment |
|
|
from mlagents_envs.base_env import ActionTuple |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
|
|
|
# Initialize environment and model |
|
|
env = UnityEnvironment(file_name="HuggyEnvironment") |
|
|
ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx") |
|
|
|
|
|
# Get behavior name |
|
|
behavior_names = list(env.behavior_specs.keys()) |
|
|
behavior_name = behavior_names[0] |
|
|
|
|
|
# Run episodes |
|
|
for episode in range(10): |
|
|
env.reset() |
|
|
decision_steps, terminal_steps = env.get_steps(behavior_name) |
|
|
|
|
|
episode_reward = 0 |
|
|
step_count = 0 |
|
|
|
|
|
while len(decision_steps) > 0: |
|
|
# Get observations |
|
|
observations = decision_steps.obs[0] |
|
|
|
|
|
# Predict actions using trained model |
|
|
actions = [] |
|
|
for obs in observations: |
|
|
action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)}) |
|
|
action = np.argmax(action_probs[0]) |
|
|
actions.append(action) |
|
|
|
|
|
# Send actions to environment |
|
|
action_tuple = ActionTuple(discrete=np.array([actions])) |
|
|
env.set_actions(behavior_name, action_tuple) |
|
|
|
|
|
# Step environment |
|
|
env.step() |
|
|
decision_steps, terminal_steps = env.get_steps(behavior_name) |
|
|
|
|
|
# Track rewards |
|
|
if len(terminal_steps) > 0: |
|
|
episode_reward += terminal_steps.reward[0] |
|
|
break |
|
|
if len(decision_steps) > 0: |
|
|
episode_reward += decision_steps.reward[0] |
|
|
|
|
|
step_count += 1 |
|
|
|
|
|
print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}") |
|
|
|
|
|
env.close() |
|
|
``` |
|
|
|
|
|
## Evaluation and Testing |
|
|
|
|
|
### Performance Evaluation Script |
|
|
|
|
|
```python |
|
|
import numpy as np |
|
|
from collections import defaultdict |
|
|
|
|
|
def evaluate_model(env, model_session, num_episodes=100): |
|
|
""" |
|
|
Evaluate the trained model performance |
|
|
""" |
|
|
results = { |
|
|
'rewards': [], |
|
|
'episode_lengths': [], |
|
|
'success_rate': 0 |
|
|
} |
|
|
|
|
|
behavior_name = list(env.behavior_specs.keys())[0] |
|
|
|
|
|
for episode in range(num_episodes): |
|
|
env.reset() |
|
|
decision_steps, terminal_steps = env.get_steps(behavior_name) |
|
|
|
|
|
episode_reward = 0 |
|
|
episode_length = 0 |
|
|
|
|
|
while len(decision_steps) > 0: |
|
|
# Get actions from model |
|
|
observations = decision_steps.obs[0] |
|
|
actions = [] |
|
|
|
|
|
for obs in observations: |
|
|
action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)}) |
|
|
action = np.argmax(action_probs[0]) # Deterministic policy |
|
|
actions.append(action) |
|
|
|
|
|
# Step environment |
|
|
action_tuple = ActionTuple(discrete=np.array([actions])) |
|
|
env.set_actions(behavior_name, action_tuple) |
|
|
env.step() |
|
|
|
|
|
decision_steps, terminal_steps = env.get_steps(behavior_name) |
|
|
episode_length += 1 |
|
|
|
|
|
# Check for episode termination |
|
|
if len(terminal_steps) > 0: |
|
|
episode_reward = terminal_steps.reward[0] |
|
|
break |
|
|
|
|
|
results['rewards'].append(episode_reward) |
|
|
results['episode_lengths'].append(episode_length) |
|
|
|
|
|
# Calculate statistics |
|
|
mean_reward = np.mean(results['rewards']) |
|
|
std_reward = np.std(results['rewards']) |
|
|
mean_length = np.mean(results['episode_lengths']) |
|
|
|
|
|
print(f"Evaluation Results ({num_episodes} episodes):") |
|
|
print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}") |
|
|
print(f"Mean Episode Length: {mean_length:.1f}") |
|
|
print(f"Min Reward: {np.min(results['rewards']):.3f}") |
|
|
print(f"Max Reward: {np.max(results['rewards']):.3f}") |
|
|
|
|
|
return results |
|
|
``` |
|
|
|
|
|
## Deployment Options |
|
|
|
|
|
### Option 1: Unity Standalone Build |
|
|
1. Build your Unity environment with the trained model |
|
|
2. The model will automatically use the ONNX file for inference |
|
|
3. Deploy as a standalone executable |
|
|
|
|
|
### Option 2: Python Integration |
|
|
```python |
|
|
# For integration into larger Python applications |
|
|
class HuggyAgent: |
|
|
def __init__(self, model_path): |
|
|
self.session = ort.InferenceSession(model_path) |
|
|
self.input_name = self.session.get_inputs()[0].name |
|
|
|
|
|
def act(self, observation): |
|
|
"""Get action from observation""" |
|
|
obs_input = np.array(observation, dtype=np.float32).reshape(1, -1) |
|
|
action_probs = self.session.run(None, {self.input_name: obs_input}) |
|
|
return np.argmax(action_probs[0]) |
|
|
|
|
|
def act_stochastic(self, observation): |
|
|
"""Get stochastic action from observation""" |
|
|
obs_input = np.array(observation, dtype=np.float32).reshape(1, -1) |
|
|
action_probs = self.session.run(None, {self.input_name: obs_input})[0] |
|
|
return np.random.choice(len(action_probs), p=action_probs) |
|
|
|
|
|
# Usage |
|
|
agent = HuggyAgent("results/Huggy2/Huggy.onnx") |
|
|
action = agent.act(current_observation) |
|
|
``` |
|
|
|
|
|
### Option 3: Web Deployment |
|
|
```python |
|
|
# For web applications using Flask/FastAPI |
|
|
from flask import Flask, request, jsonify |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
|
|
|
app = Flask(__name__) |
|
|
model = ort.InferenceSession("Huggy.onnx") |
|
|
|
|
|
@app.route('/predict', methods=['POST']) |
|
|
def predict(): |
|
|
data = request.json |
|
|
observation = np.array(data['observation'], dtype=np.float32) |
|
|
|
|
|
action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)}) |
|
|
action = int(np.argmax(action_probs[0])) |
|
|
|
|
|
return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))}) |
|
|
|
|
|
if __name__ == '__main__': |
|
|
app.run(debug=True) |
|
|
``` |
|
|
|
|
|
## Troubleshooting |
|
|
|
|
|
### Common Issues |
|
|
|
|
|
1. **ONNX Model Loading Errors** |
|
|
- Ensure ONNX runtime version compatibility |
|
|
- Check model file path and permissions |
|
|
|
|
|
2. **Unity Environment Connection** |
|
|
- Verify Unity environment executable path |
|
|
- Check port availability (default: 5004) |
|
|
|
|
|
3. **Observation Shape Mismatches** |
|
|
- Ensure observation preprocessing matches training |
|
|
- Check input normalization requirements |
|
|
|
|
|
4. **Performance Issues** |
|
|
- Use deterministic policy for consistent results |
|
|
- Consider batch inference for multiple agents |
|
|
|
|
|
### Performance Optimization |
|
|
|
|
|
```python |
|
|
# Batch processing for multiple agents |
|
|
def batch_predict(model_session, observations): |
|
|
"""Process multiple observations at once""" |
|
|
batch_obs = np.array(observations, dtype=np.float32) |
|
|
action_probs = model_session.run(None, {"obs_0": batch_obs}) |
|
|
actions = np.argmax(action_probs[0], axis=1) |
|
|
return actions |
|
|
``` |
|
|
|
|
|
This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment. |