--- library_name: ml-agents tags: - Huggy - deep-reinforcement-learning - reinforcement-learning - ML-Agents-Huggy --- # **ppo** Agent playing **Huggy** This is a trained model of a **ppo** agent playing **Huggy** using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents). # Huggy PPO Agent - Training Documentation ## Model Overview **Huggy** is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps. ## Training Environment - **Environment**: Unity ML-Agents custom environment "Huggy" - **ML-Agents Version**: 1.2.0.dev0 - **ML-Agents Envs**: 1.2.0.dev0 - **Communicator API**: 1.5.0 - **PyTorch Version**: 2.7.1+cu126 - **Unity Package Version**: 2.2.1-exp.1 ## Training Configuration ### PPO Hyperparameters - **Batch Size**: 2,048 - **Buffer Size**: 20,480 - **Learning Rate**: 0.0003 (linear schedule) - **Beta (entropy regularization)**: 0.005 (linear schedule) - **Epsilon (PPO clip parameter)**: 0.2 (linear schedule) - **Lambda (GAE parameter)**: 0.95 - **Number of Epochs**: 3 - **Shared Critic**: False ### Network Architecture - **Normalization**: Enabled - **Hidden Units**: 512 - **Number of Layers**: 3 - **Visual Encoding Type**: Simple - **Memory**: None - **Goal Conditioning Type**: Hyper - **Deterministic**: False ### Reward Configuration - **Reward Type**: Extrinsic - **Gamma (discount factor)**: 0.995 - **Reward Strength**: 1.0 - **Reward Network Hidden Units**: 128 - **Reward Network Layers**: 2 ### Training Parameters - **Maximum Steps**: 2,000,000 - **Time Horizon**: 1,000 - **Summary Frequency**: 50,000 steps - **Checkpoint Interval**: 200,000 steps - **Keep Checkpoints**: 15 - **Threaded Training**: False ## Training Performance ### Performance Progression The agent showed steady improvement throughout training: **Early Training (0-200k steps):** - Step 50k: Mean Reward = 1.840 ± 0.925 - Step 100k: Mean Reward = 2.747 ± 1.096 - Step 150k: Mean Reward = 3.031 ± 1.174 - Step 200k: Mean Reward = 3.538 ± 1.370 **Mid Training (200k-1M steps):** - Performance stabilized around 3.6-3.9 mean reward - Peak performance at 500k steps: 3.873 ± 1.783 **Late Training (1M-2M steps):** - Consistent performance around 3.5-3.8 mean reward - Final performance at 2M steps: 3.718 ± 2.132 ### Key Performance Metrics - **Training Duration**: 2,350.439 seconds (~39 minutes) - **Final Mean Reward**: 3.718 - **Final Standard Deviation**: 2.132 - **Peak Mean Reward**: 3.873 (at 500k steps) - **Lowest Standard Deviation**: 0.925 (at 50k steps) ## Training Characteristics ### Learning Curve Analysis 1. **Rapid Initial Learning**: Significant improvement in first 200k steps (1.84 → 3.54) 2. **Plateau Phase**: Performance stabilized between 200k-2M steps 3. **Variance Increase**: Standard deviation increased over time, indicating more diverse behavior patterns ### Model Checkpoints Regular ONNX model exports were created every 200k steps: - Huggy-199933.onnx - Huggy-399938.onnx - Huggy-599920.onnx - Huggy-799966.onnx - Huggy-999748.onnx - Huggy-1199265.onnx - Huggy-1399932.onnx - Huggy-1599985.onnx - Huggy-1799997.onnx - Huggy-1999614.onnx - **Final Model**: Huggy-2000364.onnx ## Technical Implementation ### Training Framework - Unity ML-Agents with PPO algorithm - Custom Unity environment integration - ONNX model export for deployment - Real-time training monitoring ### Model Architecture Details - Multi-layer perceptron with 3 hidden layers - 512 hidden units per layer - Input normalization enabled - Separate actor-critic networks (shared_critic = False) - Hypernetwork goal conditioning ### Reward Signal Processing - Single extrinsic reward signal - Discount factor of 0.995 for long-term planning - Dedicated reward network with 2 layers and 128 units ## Performance Insights ### Strengths - Consistent learning progression - Stable final performance around 3.7 mean reward - Successful completion of 2M training steps - Regular checkpoint generation for model versioning ### Observations - Standard deviation increased over training, suggesting the agent learned more diverse strategies - Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration - The agent maintained stable performance without significant degradation ### Training Efficiency - **Steps per Second**: ~851 steps/second average - **Episodes per Checkpoint**: Approximately 200-250 episodes per checkpoint - **Memory Usage**: Efficient with 20,480 buffer size and 1,000 time horizon This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics. # Huggy PPO Agent - Usage Guide ## Prerequisites Before using the Huggy model, ensure you have the following installed: ```bash # Install Unity ML-Agents pip install mlagents==1.2.0 # Install required dependencies pip install torch==2.7.1 pip install onnx pip install onnxruntime ``` ## Model Files After training, you'll have these key files: - **Huggy.onnx** - The trained model (final version) - **Huggy-2000364.onnx** - Final checkpoint model - **config.yaml** - Training configuration file - **training logs** - Performance metrics and tensorboard data ## Loading and Using the Model ### Method 1: Using ML-Agents Python API ```python from mlagents_envs.environment import UnityEnvironment from mlagents_envs.base_env import ActionTuple import numpy as np # Load the Unity environment env = UnityEnvironment(file_name="path/to/your/huggy_environment") # Reset the environment env.reset() # Get behavior specs behavior_names = list(env.behavior_specs.keys()) behavior_name = behavior_names[0] # "Huggy" spec = env.behavior_specs[behavior_name] print(f"Observation space: {spec.observation_specs}") print(f"Action space: {spec.action_spec}") ``` ### Method 2: Using ONNX Runtime for Inference ```python import onnxruntime as ort import numpy as np # Load the trained ONNX model model_path = "results/Huggy2/Huggy.onnx" ort_session = ort.InferenceSession(model_path) # Get model input/output info input_name = ort_session.get_inputs()[0].name output_name = ort_session.get_outputs()[0].name def predict_action(observation): """ Predict action using the trained model """ # Prepare observation (ensure correct shape and normalization) obs_input = np.array(observation, dtype=np.float32) # Run inference action_probs = ort_session.run([output_name], {input_name: obs_input}) # Sample action from probabilities or take deterministic action action = np.argmax(action_probs[0]) # Deterministic # OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0]) # Stochastic return action ``` ### Method 3: Running Trained Agent in Unity ```python from mlagents_envs.environment import UnityEnvironment from mlagents_envs.base_env import ActionTuple import onnxruntime as ort import numpy as np # Initialize environment and model env = UnityEnvironment(file_name="HuggyEnvironment") ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx") # Get behavior name behavior_names = list(env.behavior_specs.keys()) behavior_name = behavior_names[0] # Run episodes for episode in range(10): env.reset() decision_steps, terminal_steps = env.get_steps(behavior_name) episode_reward = 0 step_count = 0 while len(decision_steps) > 0: # Get observations observations = decision_steps.obs[0] # Predict actions using trained model actions = [] for obs in observations: action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)}) action = np.argmax(action_probs[0]) actions.append(action) # Send actions to environment action_tuple = ActionTuple(discrete=np.array([actions])) env.set_actions(behavior_name, action_tuple) # Step environment env.step() decision_steps, terminal_steps = env.get_steps(behavior_name) # Track rewards if len(terminal_steps) > 0: episode_reward += terminal_steps.reward[0] break if len(decision_steps) > 0: episode_reward += decision_steps.reward[0] step_count += 1 print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}") env.close() ``` ## Evaluation and Testing ### Performance Evaluation Script ```python import numpy as np from collections import defaultdict def evaluate_model(env, model_session, num_episodes=100): """ Evaluate the trained model performance """ results = { 'rewards': [], 'episode_lengths': [], 'success_rate': 0 } behavior_name = list(env.behavior_specs.keys())[0] for episode in range(num_episodes): env.reset() decision_steps, terminal_steps = env.get_steps(behavior_name) episode_reward = 0 episode_length = 0 while len(decision_steps) > 0: # Get actions from model observations = decision_steps.obs[0] actions = [] for obs in observations: action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)}) action = np.argmax(action_probs[0]) # Deterministic policy actions.append(action) # Step environment action_tuple = ActionTuple(discrete=np.array([actions])) env.set_actions(behavior_name, action_tuple) env.step() decision_steps, terminal_steps = env.get_steps(behavior_name) episode_length += 1 # Check for episode termination if len(terminal_steps) > 0: episode_reward = terminal_steps.reward[0] break results['rewards'].append(episode_reward) results['episode_lengths'].append(episode_length) # Calculate statistics mean_reward = np.mean(results['rewards']) std_reward = np.std(results['rewards']) mean_length = np.mean(results['episode_lengths']) print(f"Evaluation Results ({num_episodes} episodes):") print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}") print(f"Mean Episode Length: {mean_length:.1f}") print(f"Min Reward: {np.min(results['rewards']):.3f}") print(f"Max Reward: {np.max(results['rewards']):.3f}") return results ``` ## Deployment Options ### Option 1: Unity Standalone Build 1. Build your Unity environment with the trained model 2. The model will automatically use the ONNX file for inference 3. Deploy as a standalone executable ### Option 2: Python Integration ```python # For integration into larger Python applications class HuggyAgent: def __init__(self, model_path): self.session = ort.InferenceSession(model_path) self.input_name = self.session.get_inputs()[0].name def act(self, observation): """Get action from observation""" obs_input = np.array(observation, dtype=np.float32).reshape(1, -1) action_probs = self.session.run(None, {self.input_name: obs_input}) return np.argmax(action_probs[0]) def act_stochastic(self, observation): """Get stochastic action from observation""" obs_input = np.array(observation, dtype=np.float32).reshape(1, -1) action_probs = self.session.run(None, {self.input_name: obs_input})[0] return np.random.choice(len(action_probs), p=action_probs) # Usage agent = HuggyAgent("results/Huggy2/Huggy.onnx") action = agent.act(current_observation) ``` ### Option 3: Web Deployment ```python # For web applications using Flask/FastAPI from flask import Flask, request, jsonify import onnxruntime as ort import numpy as np app = Flask(__name__) model = ort.InferenceSession("Huggy.onnx") @app.route('/predict', methods=['POST']) def predict(): data = request.json observation = np.array(data['observation'], dtype=np.float32) action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)}) action = int(np.argmax(action_probs[0])) return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))}) if __name__ == '__main__': app.run(debug=True) ``` ## Troubleshooting ### Common Issues 1. **ONNX Model Loading Errors** - Ensure ONNX runtime version compatibility - Check model file path and permissions 2. **Unity Environment Connection** - Verify Unity environment executable path - Check port availability (default: 5004) 3. **Observation Shape Mismatches** - Ensure observation preprocessing matches training - Check input normalization requirements 4. **Performance Issues** - Use deterministic policy for consistent results - Consider batch inference for multiple agents ### Performance Optimization ```python # Batch processing for multiple agents def batch_predict(model_session, observations): """Process multiple observations at once""" batch_obs = np.array(observations, dtype=np.float32) action_probs = model_session.run(None, {"obs_0": batch_obs}) actions = np.argmax(action_probs[0], axis=1) return actions ``` This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment.