Adilbai
/

ppo-Huggy-Rl-agent

@@ -10,26 +10,434 @@ tags:
   # **ppo** Agent playing **Huggy**
   This is a trained model of a **ppo** agent playing **Huggy**
   using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).
-  ## Usage (with ML-Agents)
-  The Documentation: https://unity-technologies.github.io/ml-agents/ML-Agents-Toolkit-Documentation/
-  We wrote a complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub:
-  - A *short tutorial* where you teach Huggy the Dog 🐶 to fetch the stick and then play with him directly in your
-  browser: https://huggingface.co/learn/deep-rl-course/unitbonus1/introduction
-  - A *longer tutorial* to understand how works ML-Agents:
-  https://huggingface.co/learn/deep-rl-course/unit5/introduction
-  ### Resume the training
-  ```bash
-  mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
-  ```
-  ### Watch your Agent play
-  You can watch your agent **playing directly in your browser**
-  1. If the environment is part of ML-Agents official environments, go to https://huggingface.co/unity
-  2. Step 1: Find your model_id: Adilbai/ppo-Huggy-Rl-agent
-  3. Step 2: Select your *.nn /*.onnx file
-  4. Click on Watch the agent play 👀

   # **ppo** Agent playing **Huggy**
   This is a trained model of a **ppo** agent playing **Huggy**
   using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).
+  # Huggy PPO Agent - Training Documentation
+## Model Overview
+**Huggy** is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps.
+## Training Environment
+- **Environment**: Unity ML-Agents custom environment "Huggy"
+- **ML-Agents Version**: 1.2.0.dev0
+- **ML-Agents Envs**: 1.2.0.dev0
+- **Communicator API**: 1.5.0
+- **PyTorch Version**: 2.7.1+cu126
+- **Unity Package Version**: 2.2.1-exp.1
+## Training Configuration
+### PPO Hyperparameters
+- **Batch Size**: 2,048
+- **Buffer Size**: 20,480
+- **Learning Rate**: 0.0003 (linear schedule)
+- **Beta (entropy regularization)**: 0.005 (linear schedule)
+- **Epsilon (PPO clip parameter)**: 0.2 (linear schedule)
+- **Lambda (GAE parameter)**: 0.95
+- **Number of Epochs**: 3
+- **Shared Critic**: False
+### Network Architecture
+- **Normalization**: Enabled
+- **Hidden Units**: 512
+- **Number of Layers**: 3
+- **Visual Encoding Type**: Simple
+- **Memory**: None
+- **Goal Conditioning Type**: Hyper
+- **Deterministic**: False
+### Reward Configuration
+- **Reward Type**: Extrinsic
+- **Gamma (discount factor)**: 0.995
+- **Reward Strength**: 1.0
+- **Reward Network Hidden Units**: 128
+- **Reward Network Layers**: 2
+### Training Parameters
+- **Maximum Steps**: 2,000,000
+- **Time Horizon**: 1,000
+- **Summary Frequency**: 50,000 steps
+- **Checkpoint Interval**: 200,000 steps
+- **Keep Checkpoints**: 15
+- **Threaded Training**: False
+## Training Performance
+### Performance Progression
+The agent showed steady improvement throughout training:
+**Early Training (0-200k steps):**
+- Step 50k: Mean Reward = 1.840 ± 0.925
+- Step 100k: Mean Reward = 2.747 ± 1.096
+- Step 150k: Mean Reward = 3.031 ± 1.174
+- Step 200k: Mean Reward = 3.538 ± 1.370
+**Mid Training (200k-1M steps):**
+- Performance stabilized around 3.6-3.9 mean reward
+- Peak performance at 500k steps: 3.873 ± 1.783
+**Late Training (1M-2M steps):**
+- Consistent performance around 3.5-3.8 mean reward
+- Final performance at 2M steps: 3.718 ± 2.132
+### Key Performance Metrics
+- **Training Duration**: 2,350.439 seconds (~39 minutes)
+- **Final Mean Reward**: 3.718
+- **Final Standard Deviation**: 2.132
+- **Peak Mean Reward**: 3.873 (at 500k steps)
+- **Lowest Standard Deviation**: 0.925 (at 50k steps)
+## Training Characteristics
+### Learning Curve Analysis
+1. **Rapid Initial Learning**: Significant improvement in first 200k steps (1.84 → 3.54)
+2. **Plateau Phase**: Performance stabilized between 200k-2M steps
+3. **Variance Increase**: Standard deviation increased over time, indicating more diverse behavior patterns
+### Model Checkpoints
+Regular ONNX model exports were created every 200k steps:
+- Huggy-199933.onnx
+- Huggy-399938.onnx
+- Huggy-599920.onnx
+- Huggy-799966.onnx
+- Huggy-999748.onnx
+- Huggy-1199265.onnx
+- Huggy-1399932.onnx
+- Huggy-1599985.onnx
+- Huggy-1799997.onnx
+- Huggy-1999614.onnx
+- **Final Model**: Huggy-2000364.onnx
+## Technical Implementation
+### Training Framework
+- Unity ML-Agents with PPO algorithm
+- Custom Unity environment integration
+- ONNX model export for deployment
+- Real-time training monitoring
+### Model Architecture Details
+- Multi-layer perceptron with 3 hidden layers
+- 512 hidden units per layer
+- Input normalization enabled
+- Separate actor-critic networks (shared_critic = False)
+- Hypernetwork goal conditioning
+### Reward Signal Processing
+- Single extrinsic reward signal
+- Discount factor of 0.995 for long-term planning
+- Dedicated reward network with 2 layers and 128 units
+## Performance Insights
+### Strengths
+- Consistent learning progression
+- Stable final performance around 3.7 mean reward
+- Successful completion of 2M training steps
+- Regular checkpoint generation for model versioning
+### Observations
+- Standard deviation increased over training, suggesting the agent learned more diverse strategies
+- Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration
+- The agent maintained stable performance without significant degradation
+### Training Efficiency
+- **Steps per Second**: ~851 steps/second average
+- **Episodes per Checkpoint**: Approximately 200-250 episodes per checkpoint
+- **Memory Usage**: Efficient with 20,480 buffer size and 1,000 time horizon
+This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics.
+# Huggy PPO Agent - Usage Guide
+## Prerequisites
+Before using the Huggy model, ensure you have the following installed:
+```bash
+# Install Unity ML-Agents
+pip install mlagents==1.2.0
+# Install required dependencies
+pip install torch==2.7.1
+pip install onnx
+pip install onnxruntime
+```
+## Model Files
+After training, you'll have these key files:
+- **Huggy.onnx** - The trained model (final version)
+- **Huggy-2000364.onnx** - Final checkpoint model
+- **config.yaml** - Training configuration file
+- **training logs** - Performance metrics and tensorboard data
+## Loading and Using the Model
+### Method 1: Using ML-Agents Python API
+```python
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.base_env import ActionTuple
+import numpy as np
+# Load the Unity environment
+env = UnityEnvironment(file_name="path/to/your/huggy_environment")
+# Reset the environment
+env.reset()
+# Get behavior specs
+behavior_names = list(env.behavior_specs.keys())
+behavior_name = behavior_names[0]  # "Huggy"
+spec = env.behavior_specs[behavior_name]
+print(f"Observation space: {spec.observation_specs}")
+print(f"Action space: {spec.action_spec}")
+```
+### Method 2: Using ONNX Runtime for Inference
+```python
+import onnxruntime as ort
+import numpy as np
+# Load the trained ONNX model
+model_path = "results/Huggy2/Huggy.onnx"
+ort_session = ort.InferenceSession(model_path)
+# Get model input/output info
+input_name = ort_session.get_inputs()[0].name
+output_name = ort_session.get_outputs()[0].name
+def predict_action(observation):
+    """
+    Predict action using the trained model
+    """
+    # Prepare observation (ensure correct shape and normalization)
+    obs_input = np.array(observation, dtype=np.float32)
+    # Run inference
+    action_probs = ort_session.run([output_name], {input_name: obs_input})
+    # Sample action from probabilities or take deterministic action
+    action = np.argmax(action_probs[0])  # Deterministic
+    # OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0])  # Stochastic
+    return action
+```
+### Method 3: Running Trained Agent in Unity
+```python
+from mlagents_envs.environment import UnityEnvironment
+from mlagents_envs.base_env import ActionTuple
+import onnxruntime as ort
+import numpy as np
+# Initialize environment and model
+env = UnityEnvironment(file_name="HuggyEnvironment")
+ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx")
+# Get behavior name
+behavior_names = list(env.behavior_specs.keys())
+behavior_name = behavior_names[0]
+# Run episodes
+for episode in range(10):
+    env.reset()
+    decision_steps, terminal_steps = env.get_steps(behavior_name)
+    episode_reward = 0
+    step_count = 0
+    while len(decision_steps) > 0:
+        # Get observations
+        observations = decision_steps.obs[0]
+        # Predict actions using trained model
+        actions = []
+        for obs in observations:
+            action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)})
+            action = np.argmax(action_probs[0])
+            actions.append(action)
+        # Send actions to environment
+        action_tuple = ActionTuple(discrete=np.array([actions]))
+        env.set_actions(behavior_name, action_tuple)
+        # Step environment
+        env.step()
+        decision_steps, terminal_steps = env.get_steps(behavior_name)
+        # Track rewards
+        if len(terminal_steps) > 0:
+            episode_reward += terminal_steps.reward[0]
+            break
+        if len(decision_steps) > 0:
+            episode_reward += decision_steps.reward[0]
+        step_count += 1
+    print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}")
+env.close()
+```
+## Evaluation and Testing
+### Performance Evaluation Script
+```python
+import numpy as np
+from collections import defaultdict
+def evaluate_model(env, model_session, num_episodes=100):
+    """
+    Evaluate the trained model performance
+    """
+    results = {
+        'rewards': [],
+        'episode_lengths': [],
+        'success_rate': 0
+    }
+    behavior_name = list(env.behavior_specs.keys())[0]
+    for episode in range(num_episodes):
+        env.reset()
+        decision_steps, terminal_steps = env.get_steps(behavior_name)
+        episode_reward = 0
+        episode_length = 0
+        while len(decision_steps) > 0:
+            # Get actions from model
+            observations = decision_steps.obs[0]
+            actions = []
+            for obs in observations:
+                action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)})
+                action = np.argmax(action_probs[0])  # Deterministic policy
+                actions.append(action)
+            # Step environment
+            action_tuple = ActionTuple(discrete=np.array([actions]))
+            env.set_actions(behavior_name, action_tuple)
+            env.step()
+            decision_steps, terminal_steps = env.get_steps(behavior_name)
+            episode_length += 1
+            # Check for episode termination
+            if len(terminal_steps) > 0:
+                episode_reward = terminal_steps.reward[0]
+                break
+        results['rewards'].append(episode_reward)
+        results['episode_lengths'].append(episode_length)
+    # Calculate statistics
+    mean_reward = np.mean(results['rewards'])
+    std_reward = np.std(results['rewards'])
+    mean_length = np.mean(results['episode_lengths'])
+    print(f"Evaluation Results ({num_episodes} episodes):")
+    print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}")
+    print(f"Mean Episode Length: {mean_length:.1f}")
+    print(f"Min Reward: {np.min(results['rewards']):.3f}")
+    print(f"Max Reward: {np.max(results['rewards']):.3f}")
+    return results
+```
+## Deployment Options
+### Option 1: Unity Standalone Build
+1. Build your Unity environment with the trained model
+2. The model will automatically use the ONNX file for inference
+3. Deploy as a standalone executable
+### Option 2: Python Integration
+```python
+# For integration into larger Python applications
+class HuggyAgent:
+    def __init__(self, model_path):
+        self.session = ort.InferenceSession(model_path)
+        self.input_name = self.session.get_inputs()[0].name
+    def act(self, observation):
+        """Get action from observation"""
+        obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
+        action_probs = self.session.run(None, {self.input_name: obs_input})
+        return np.argmax(action_probs[0])
+    def act_stochastic(self, observation):
+        """Get stochastic action from observation"""
+        obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
+        action_probs = self.session.run(None, {self.input_name: obs_input})[0]
+        return np.random.choice(len(action_probs), p=action_probs)
+# Usage
+agent = HuggyAgent("results/Huggy2/Huggy.onnx")
+action = agent.act(current_observation)
+```
+### Option 3: Web Deployment
+```python
+# For web applications using Flask/FastAPI
+from flask import Flask, request, jsonify
+import onnxruntime as ort
+import numpy as np
+app = Flask(__name__)
+model = ort.InferenceSession("Huggy.onnx")
+@app.route('/predict', methods=['POST'])
+def predict():
+    data = request.json
+    observation = np.array(data['observation'], dtype=np.float32)
+    action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)})
+    action = int(np.argmax(action_probs[0]))
+    return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))})
+if __name__ == '__main__':
+    app.run(debug=True)
+```
+## Troubleshooting
+### Common Issues
+1. **ONNX Model Loading Errors**
+   - Ensure ONNX runtime version compatibility
+   - Check model file path and permissions
+2. **Unity Environment Connection**
+   - Verify Unity environment executable path
+   - Check port availability (default: 5004)
+3. **Observation Shape Mismatches**
+   - Ensure observation preprocessing matches training
+   - Check input normalization requirements
+4. **Performance Issues**
+   - Use deterministic policy for consistent results
+   - Consider batch inference for multiple agents
+### Performance Optimization
+```python
+# Batch processing for multiple agents
+def batch_predict(model_session, observations):
+    """Process multiple observations at once"""
+    batch_obs = np.array(observations, dtype=np.float32)
+    action_probs = model_session.run(None, {"obs_0": batch_obs})
+    actions = np.argmax(action_probs[0], axis=1)
+    return actions
+```
+This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment.