---
library_name: ml-agents
tags:
- Huggy
- deep-reinforcement-learning
- reinforcement-learning
- ML-Agents-Huggy
---

  # **ppo** Agent playing **Huggy**
  This is a trained model of a **ppo** agent playing **Huggy**
  using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).
  # Huggy PPO Agent - Training Documentation

## Model Overview

**Huggy** is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps.

## Training Environment

- **Environment**: Unity ML-Agents custom environment "Huggy"
- **ML-Agents Version**: 1.2.0.dev0
- **ML-Agents Envs**: 1.2.0.dev0
- **Communicator API**: 1.5.0
- **PyTorch Version**: 2.7.1+cu126
- **Unity Package Version**: 2.2.1-exp.1

## Training Configuration

### PPO Hyperparameters
- **Batch Size**: 2,048
- **Buffer Size**: 20,480
- **Learning Rate**: 0.0003 (linear schedule)
- **Beta (entropy regularization)**: 0.005 (linear schedule)
- **Epsilon (PPO clip parameter)**: 0.2 (linear schedule)
- **Lambda (GAE parameter)**: 0.95
- **Number of Epochs**: 3
- **Shared Critic**: False

### Network Architecture
- **Normalization**: Enabled
- **Hidden Units**: 512
- **Number of Layers**: 3
- **Visual Encoding Type**: Simple
- **Memory**: None
- **Goal Conditioning Type**: Hyper
- **Deterministic**: False

### Reward Configuration
- **Reward Type**: Extrinsic
- **Gamma (discount factor)**: 0.995
- **Reward Strength**: 1.0
- **Reward Network Hidden Units**: 128
- **Reward Network Layers**: 2

### Training Parameters
- **Maximum Steps**: 2,000,000
- **Time Horizon**: 1,000
- **Summary Frequency**: 50,000 steps
- **Checkpoint Interval**: 200,000 steps
- **Keep Checkpoints**: 15
- **Threaded Training**: False

## Training Performance

### Performance Progression

The agent showed steady improvement throughout training:

**Early Training (0-200k steps):**
- Step 50k: Mean Reward = 1.840 ± 0.925
- Step 100k: Mean Reward = 2.747 ± 1.096
- Step 150k: Mean Reward = 3.031 ± 1.174
- Step 200k: Mean Reward = 3.538 ± 1.370

**Mid Training (200k-1M steps):**
- Performance stabilized around 3.6-3.9 mean reward
- Peak performance at 500k steps: 3.873 ± 1.783

**Late Training (1M-2M steps):**
- Consistent performance around 3.5-3.8 mean reward
- Final performance at 2M steps: 3.718 ± 2.132

### Key Performance Metrics

- **Training Duration**: 2,350.439 seconds (~39 minutes)
- **Final Mean Reward**: 3.718
- **Final Standard Deviation**: 2.132
- **Peak Mean Reward**: 3.873 (at 500k steps)
- **Lowest Standard Deviation**: 0.925 (at 50k steps)

## Training Characteristics

### Learning Curve Analysis
1. **Rapid Initial Learning**: Significant improvement in first 200k steps (1.84 → 3.54)
2. **Plateau Phase**: Performance stabilized between 200k-2M steps
3. **Variance Increase**: Standard deviation increased over time, indicating more diverse behavior patterns

### Model Checkpoints
Regular ONNX model exports were created every 200k steps:
- Huggy-199933.onnx
- Huggy-399938.onnx
- Huggy-599920.onnx
- Huggy-799966.onnx
- Huggy-999748.onnx
- Huggy-1199265.onnx
- Huggy-1399932.onnx
- Huggy-1599985.onnx
- Huggy-1799997.onnx
- Huggy-1999614.onnx
- **Final Model**: Huggy-2000364.onnx

## Technical Implementation

### Training Framework
- Unity ML-Agents with PPO algorithm
- Custom Unity environment integration
- ONNX model export for deployment
- Real-time training monitoring

### Model Architecture Details
- Multi-layer perceptron with 3 hidden layers
- 512 hidden units per layer
- Input normalization enabled
- Separate actor-critic networks (shared_critic = False)
- Hypernetwork goal conditioning

### Reward Signal Processing
- Single extrinsic reward signal
- Discount factor of 0.995 for long-term planning
- Dedicated reward network with 2 layers and 128 units

## Performance Insights

### Strengths
- Consistent learning progression
- Stable final performance around 3.7 mean reward
- Successful completion of 2M training steps
- Regular checkpoint generation for model versioning

### Observations
- Standard deviation increased over training, suggesting the agent learned more diverse strategies
- Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration
- The agent maintained stable performance without significant degradation

### Training Efficiency
- **Steps per Second**: ~851 steps/second average
- **Episodes per Checkpoint**: Approximately 200-250 episodes per checkpoint
- **Memory Usage**: Efficient with 20,480 buffer size and 1,000 time horizon

This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics.
# Huggy PPO Agent - Usage Guide

## Prerequisites

Before using the Huggy model, ensure you have the following installed:

```bash
# Install Unity ML-Agents
pip install mlagents==1.2.0

# Install required dependencies
pip install torch==2.7.1
pip install onnx
pip install onnxruntime
```

## Model Files

After training, you'll have these key files:
- **Huggy.onnx** - The trained model (final version)
- **Huggy-2000364.onnx** - Final checkpoint model
- **config.yaml** - Training configuration file
- **training logs** - Performance metrics and tensorboard data

## Loading and Using the Model

### Method 1: Using ML-Agents Python API

```python
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.base_env import ActionTuple
import numpy as np

# Load the Unity environment
env = UnityEnvironment(file_name="path/to/your/huggy_environment")

# Reset the environment
env.reset()

# Get behavior specs
behavior_names = list(env.behavior_specs.keys())
behavior_name = behavior_names[0]  # "Huggy"
spec = env.behavior_specs[behavior_name]

print(f"Observation space: {spec.observation_specs}")
print(f"Action space: {spec.action_spec}")
```

### Method 2: Using ONNX Runtime for Inference

```python
import onnxruntime as ort
import numpy as np

# Load the trained ONNX model
model_path = "results/Huggy2/Huggy.onnx"
ort_session = ort.InferenceSession(model_path)

# Get model input/output info
input_name = ort_session.get_inputs()[0].name
output_name = ort_session.get_outputs()[0].name

def predict_action(observation):
    """
    Predict action using the trained model
    """
    # Prepare observation (ensure correct shape and normalization)
    obs_input = np.array(observation, dtype=np.float32)
    
    # Run inference
    action_probs = ort_session.run([output_name], {input_name: obs_input})
    
    # Sample action from probabilities or take deterministic action
    action = np.argmax(action_probs[0])  # Deterministic
    # OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0])  # Stochastic
    
    return action
```

### Method 3: Running Trained Agent in Unity

```python
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.base_env import ActionTuple
import onnxruntime as ort
import numpy as np

# Initialize environment and model
env = UnityEnvironment(file_name="HuggyEnvironment")
ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx")

# Get behavior name
behavior_names = list(env.behavior_specs.keys())
behavior_name = behavior_names[0]

# Run episodes
for episode in range(10):
    env.reset()
    decision_steps, terminal_steps = env.get_steps(behavior_name)
    
    episode_reward = 0
    step_count = 0
    
    while len(decision_steps) > 0:
        # Get observations
        observations = decision_steps.obs[0]
        
        # Predict actions using trained model
        actions = []
        for obs in observations:
            action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)})
            action = np.argmax(action_probs[0])
            actions.append(action)
        
        # Send actions to environment
        action_tuple = ActionTuple(discrete=np.array([actions]))
        env.set_actions(behavior_name, action_tuple)
        
        # Step environment
        env.step()
        decision_steps, terminal_steps = env.get_steps(behavior_name)
        
        # Track rewards
        if len(terminal_steps) > 0:
            episode_reward += terminal_steps.reward[0]
            break
        if len(decision_steps) > 0:
            episode_reward += decision_steps.reward[0]
        
        step_count += 1
    
    print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}")

env.close()
```

## Evaluation and Testing

### Performance Evaluation Script

```python
import numpy as np
from collections import defaultdict

def evaluate_model(env, model_session, num_episodes=100):
    """
    Evaluate the trained model performance
    """
    results = {
        'rewards': [],
        'episode_lengths': [],
        'success_rate': 0
    }
    
    behavior_name = list(env.behavior_specs.keys())[0]
    
    for episode in range(num_episodes):
        env.reset()
        decision_steps, terminal_steps = env.get_steps(behavior_name)
        
        episode_reward = 0
        episode_length = 0
        
        while len(decision_steps) > 0:
            # Get actions from model
            observations = decision_steps.obs[0]
            actions = []
            
            for obs in observations:
                action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)})
                action = np.argmax(action_probs[0])  # Deterministic policy
                actions.append(action)
            
            # Step environment
            action_tuple = ActionTuple(discrete=np.array([actions]))
            env.set_actions(behavior_name, action_tuple)
            env.step()
            
            decision_steps, terminal_steps = env.get_steps(behavior_name)
            episode_length += 1
            
            # Check for episode termination
            if len(terminal_steps) > 0:
                episode_reward = terminal_steps.reward[0]
                break
        
        results['rewards'].append(episode_reward)
        results['episode_lengths'].append(episode_length)
    
    # Calculate statistics
    mean_reward = np.mean(results['rewards'])
    std_reward = np.std(results['rewards'])
    mean_length = np.mean(results['episode_lengths'])
    
    print(f"Evaluation Results ({num_episodes} episodes):")
    print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}")
    print(f"Mean Episode Length: {mean_length:.1f}")
    print(f"Min Reward: {np.min(results['rewards']):.3f}")
    print(f"Max Reward: {np.max(results['rewards']):.3f}")
    
    return results
```

## Deployment Options

### Option 1: Unity Standalone Build
1. Build your Unity environment with the trained model
2. The model will automatically use the ONNX file for inference
3. Deploy as a standalone executable

### Option 2: Python Integration
```python
# For integration into larger Python applications
class HuggyAgent:
    def __init__(self, model_path):
        self.session = ort.InferenceSession(model_path)
        self.input_name = self.session.get_inputs()[0].name
        
    def act(self, observation):
        """Get action from observation"""
        obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
        action_probs = self.session.run(None, {self.input_name: obs_input})
        return np.argmax(action_probs[0])
    
    def act_stochastic(self, observation):
        """Get stochastic action from observation"""
        obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
        action_probs = self.session.run(None, {self.input_name: obs_input})[0]
        return np.random.choice(len(action_probs), p=action_probs)

# Usage
agent = HuggyAgent("results/Huggy2/Huggy.onnx")
action = agent.act(current_observation)
```

### Option 3: Web Deployment
```python
# For web applications using Flask/FastAPI
from flask import Flask, request, jsonify
import onnxruntime as ort
import numpy as np

app = Flask(__name__)
model = ort.InferenceSession("Huggy.onnx")

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    observation = np.array(data['observation'], dtype=np.float32)
    
    action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)})
    action = int(np.argmax(action_probs[0]))
    
    return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))})

if __name__ == '__main__':
    app.run(debug=True)
```

## Troubleshooting

### Common Issues

1. **ONNX Model Loading Errors**
   - Ensure ONNX runtime version compatibility
   - Check model file path and permissions

2. **Unity Environment Connection**
   - Verify Unity environment executable path
   - Check port availability (default: 5004)

3. **Observation Shape Mismatches**
   - Ensure observation preprocessing matches training
   - Check input normalization requirements

4. **Performance Issues**
   - Use deterministic policy for consistent results
   - Consider batch inference for multiple agents

### Performance Optimization

```python
# Batch processing for multiple agents
def batch_predict(model_session, observations):
    """Process multiple observations at once"""
    batch_obs = np.array(observations, dtype=np.float32)
    action_probs = model_session.run(None, {"obs_0": batch_obs})
    actions = np.argmax(action_probs[0], axis=1)
    return actions
```

This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment.