ppo-Huggy-Rl-agent / README.md
Adilbai's picture
Update README.md
74f9393 verified
---
library_name: ml-agents
tags:
- Huggy
- deep-reinforcement-learning
- reinforcement-learning
- ML-Agents-Huggy
---
# **ppo** Agent playing **Huggy**
This is a trained model of a **ppo** agent playing **Huggy**
using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).
# Huggy PPO Agent - Training Documentation
## Model Overview
**Huggy** is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps.
## Training Environment
- **Environment**: Unity ML-Agents custom environment "Huggy"
- **ML-Agents Version**: 1.2.0.dev0
- **ML-Agents Envs**: 1.2.0.dev0
- **Communicator API**: 1.5.0
- **PyTorch Version**: 2.7.1+cu126
- **Unity Package Version**: 2.2.1-exp.1
## Training Configuration
### PPO Hyperparameters
- **Batch Size**: 2,048
- **Buffer Size**: 20,480
- **Learning Rate**: 0.0003 (linear schedule)
- **Beta (entropy regularization)**: 0.005 (linear schedule)
- **Epsilon (PPO clip parameter)**: 0.2 (linear schedule)
- **Lambda (GAE parameter)**: 0.95
- **Number of Epochs**: 3
- **Shared Critic**: False
### Network Architecture
- **Normalization**: Enabled
- **Hidden Units**: 512
- **Number of Layers**: 3
- **Visual Encoding Type**: Simple
- **Memory**: None
- **Goal Conditioning Type**: Hyper
- **Deterministic**: False
### Reward Configuration
- **Reward Type**: Extrinsic
- **Gamma (discount factor)**: 0.995
- **Reward Strength**: 1.0
- **Reward Network Hidden Units**: 128
- **Reward Network Layers**: 2
### Training Parameters
- **Maximum Steps**: 2,000,000
- **Time Horizon**: 1,000
- **Summary Frequency**: 50,000 steps
- **Checkpoint Interval**: 200,000 steps
- **Keep Checkpoints**: 15
- **Threaded Training**: False
## Training Performance
### Performance Progression
The agent showed steady improvement throughout training:
**Early Training (0-200k steps):**
- Step 50k: Mean Reward = 1.840 ± 0.925
- Step 100k: Mean Reward = 2.747 ± 1.096
- Step 150k: Mean Reward = 3.031 ± 1.174
- Step 200k: Mean Reward = 3.538 ± 1.370
**Mid Training (200k-1M steps):**
- Performance stabilized around 3.6-3.9 mean reward
- Peak performance at 500k steps: 3.873 ± 1.783
**Late Training (1M-2M steps):**
- Consistent performance around 3.5-3.8 mean reward
- Final performance at 2M steps: 3.718 ± 2.132
### Key Performance Metrics
- **Training Duration**: 2,350.439 seconds (~39 minutes)
- **Final Mean Reward**: 3.718
- **Final Standard Deviation**: 2.132
- **Peak Mean Reward**: 3.873 (at 500k steps)
- **Lowest Standard Deviation**: 0.925 (at 50k steps)
## Training Characteristics
### Learning Curve Analysis
1. **Rapid Initial Learning**: Significant improvement in first 200k steps (1.84 → 3.54)
2. **Plateau Phase**: Performance stabilized between 200k-2M steps
3. **Variance Increase**: Standard deviation increased over time, indicating more diverse behavior patterns
### Model Checkpoints
Regular ONNX model exports were created every 200k steps:
- Huggy-199933.onnx
- Huggy-399938.onnx
- Huggy-599920.onnx
- Huggy-799966.onnx
- Huggy-999748.onnx
- Huggy-1199265.onnx
- Huggy-1399932.onnx
- Huggy-1599985.onnx
- Huggy-1799997.onnx
- Huggy-1999614.onnx
- **Final Model**: Huggy-2000364.onnx
## Technical Implementation
### Training Framework
- Unity ML-Agents with PPO algorithm
- Custom Unity environment integration
- ONNX model export for deployment
- Real-time training monitoring
### Model Architecture Details
- Multi-layer perceptron with 3 hidden layers
- 512 hidden units per layer
- Input normalization enabled
- Separate actor-critic networks (shared_critic = False)
- Hypernetwork goal conditioning
### Reward Signal Processing
- Single extrinsic reward signal
- Discount factor of 0.995 for long-term planning
- Dedicated reward network with 2 layers and 128 units
## Performance Insights
### Strengths
- Consistent learning progression
- Stable final performance around 3.7 mean reward
- Successful completion of 2M training steps
- Regular checkpoint generation for model versioning
### Observations
- Standard deviation increased over training, suggesting the agent learned more diverse strategies
- Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration
- The agent maintained stable performance without significant degradation
### Training Efficiency
- **Steps per Second**: ~851 steps/second average
- **Episodes per Checkpoint**: Approximately 200-250 episodes per checkpoint
- **Memory Usage**: Efficient with 20,480 buffer size and 1,000 time horizon
This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics.
# Huggy PPO Agent - Usage Guide
## Prerequisites
Before using the Huggy model, ensure you have the following installed:
```bash
# Install Unity ML-Agents
pip install mlagents==1.2.0
# Install required dependencies
pip install torch==2.7.1
pip install onnx
pip install onnxruntime
```
## Model Files
After training, you'll have these key files:
- **Huggy.onnx** - The trained model (final version)
- **Huggy-2000364.onnx** - Final checkpoint model
- **config.yaml** - Training configuration file
- **training logs** - Performance metrics and tensorboard data
## Loading and Using the Model
### Method 1: Using ML-Agents Python API
```python
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.base_env import ActionTuple
import numpy as np
# Load the Unity environment
env = UnityEnvironment(file_name="path/to/your/huggy_environment")
# Reset the environment
env.reset()
# Get behavior specs
behavior_names = list(env.behavior_specs.keys())
behavior_name = behavior_names[0] # "Huggy"
spec = env.behavior_specs[behavior_name]
print(f"Observation space: {spec.observation_specs}")
print(f"Action space: {spec.action_spec}")
```
### Method 2: Using ONNX Runtime for Inference
```python
import onnxruntime as ort
import numpy as np
# Load the trained ONNX model
model_path = "results/Huggy2/Huggy.onnx"
ort_session = ort.InferenceSession(model_path)
# Get model input/output info
input_name = ort_session.get_inputs()[0].name
output_name = ort_session.get_outputs()[0].name
def predict_action(observation):
"""
Predict action using the trained model
"""
# Prepare observation (ensure correct shape and normalization)
obs_input = np.array(observation, dtype=np.float32)
# Run inference
action_probs = ort_session.run([output_name], {input_name: obs_input})
# Sample action from probabilities or take deterministic action
action = np.argmax(action_probs[0]) # Deterministic
# OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0]) # Stochastic
return action
```
### Method 3: Running Trained Agent in Unity
```python
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.base_env import ActionTuple
import onnxruntime as ort
import numpy as np
# Initialize environment and model
env = UnityEnvironment(file_name="HuggyEnvironment")
ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx")
# Get behavior name
behavior_names = list(env.behavior_specs.keys())
behavior_name = behavior_names[0]
# Run episodes
for episode in range(10):
env.reset()
decision_steps, terminal_steps = env.get_steps(behavior_name)
episode_reward = 0
step_count = 0
while len(decision_steps) > 0:
# Get observations
observations = decision_steps.obs[0]
# Predict actions using trained model
actions = []
for obs in observations:
action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)})
action = np.argmax(action_probs[0])
actions.append(action)
# Send actions to environment
action_tuple = ActionTuple(discrete=np.array([actions]))
env.set_actions(behavior_name, action_tuple)
# Step environment
env.step()
decision_steps, terminal_steps = env.get_steps(behavior_name)
# Track rewards
if len(terminal_steps) > 0:
episode_reward += terminal_steps.reward[0]
break
if len(decision_steps) > 0:
episode_reward += decision_steps.reward[0]
step_count += 1
print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}")
env.close()
```
## Evaluation and Testing
### Performance Evaluation Script
```python
import numpy as np
from collections import defaultdict
def evaluate_model(env, model_session, num_episodes=100):
"""
Evaluate the trained model performance
"""
results = {
'rewards': [],
'episode_lengths': [],
'success_rate': 0
}
behavior_name = list(env.behavior_specs.keys())[0]
for episode in range(num_episodes):
env.reset()
decision_steps, terminal_steps = env.get_steps(behavior_name)
episode_reward = 0
episode_length = 0
while len(decision_steps) > 0:
# Get actions from model
observations = decision_steps.obs[0]
actions = []
for obs in observations:
action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)})
action = np.argmax(action_probs[0]) # Deterministic policy
actions.append(action)
# Step environment
action_tuple = ActionTuple(discrete=np.array([actions]))
env.set_actions(behavior_name, action_tuple)
env.step()
decision_steps, terminal_steps = env.get_steps(behavior_name)
episode_length += 1
# Check for episode termination
if len(terminal_steps) > 0:
episode_reward = terminal_steps.reward[0]
break
results['rewards'].append(episode_reward)
results['episode_lengths'].append(episode_length)
# Calculate statistics
mean_reward = np.mean(results['rewards'])
std_reward = np.std(results['rewards'])
mean_length = np.mean(results['episode_lengths'])
print(f"Evaluation Results ({num_episodes} episodes):")
print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}")
print(f"Mean Episode Length: {mean_length:.1f}")
print(f"Min Reward: {np.min(results['rewards']):.3f}")
print(f"Max Reward: {np.max(results['rewards']):.3f}")
return results
```
## Deployment Options
### Option 1: Unity Standalone Build
1. Build your Unity environment with the trained model
2. The model will automatically use the ONNX file for inference
3. Deploy as a standalone executable
### Option 2: Python Integration
```python
# For integration into larger Python applications
class HuggyAgent:
def __init__(self, model_path):
self.session = ort.InferenceSession(model_path)
self.input_name = self.session.get_inputs()[0].name
def act(self, observation):
"""Get action from observation"""
obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
action_probs = self.session.run(None, {self.input_name: obs_input})
return np.argmax(action_probs[0])
def act_stochastic(self, observation):
"""Get stochastic action from observation"""
obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
action_probs = self.session.run(None, {self.input_name: obs_input})[0]
return np.random.choice(len(action_probs), p=action_probs)
# Usage
agent = HuggyAgent("results/Huggy2/Huggy.onnx")
action = agent.act(current_observation)
```
### Option 3: Web Deployment
```python
# For web applications using Flask/FastAPI
from flask import Flask, request, jsonify
import onnxruntime as ort
import numpy as np
app = Flask(__name__)
model = ort.InferenceSession("Huggy.onnx")
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
observation = np.array(data['observation'], dtype=np.float32)
action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)})
action = int(np.argmax(action_probs[0]))
return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))})
if __name__ == '__main__':
app.run(debug=True)
```
## Troubleshooting
### Common Issues
1. **ONNX Model Loading Errors**
- Ensure ONNX runtime version compatibility
- Check model file path and permissions
2. **Unity Environment Connection**
- Verify Unity environment executable path
- Check port availability (default: 5004)
3. **Observation Shape Mismatches**
- Ensure observation preprocessing matches training
- Check input normalization requirements
4. **Performance Issues**
- Use deterministic policy for consistent results
- Consider batch inference for multiple agents
### Performance Optimization
```python
# Batch processing for multiple agents
def batch_predict(model_session, observations):
"""Process multiple observations at once"""
batch_obs = np.array(observations, dtype=np.float32)
action_probs = model_session.run(None, {"obs_0": batch_obs})
actions = np.argmax(action_probs[0], axis=1)
return actions
```
This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment.