ppo-Huggy-Rl-agent / README.md

Update README.md

74f9393 verified 8 months ago

13.6 kB

	---
	library_name: ml-agents
	tags:
	- Huggy
	- deep-reinforcement-learning
	- reinforcement-learning
	- ML-Agents-Huggy
	---

	# ppo Agent playing Huggy
	This is a trained model of a ppo agent playing Huggy
	using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).
	# Huggy PPO Agent - Training Documentation

	## Model Overview

	Huggy is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps.

	## Training Environment

	- Environment: Unity ML-Agents custom environment "Huggy"
	- ML-Agents Version: 1.2.0.dev0
	- ML-Agents Envs: 1.2.0.dev0
	- Communicator API: 1.5.0
	- PyTorch Version: 2.7.1+cu126
	- Unity Package Version: 2.2.1-exp.1

	## Training Configuration

	### PPO Hyperparameters
	- Batch Size: 2,048
	- Buffer Size: 20,480
	- Learning Rate: 0.0003 (linear schedule)
	- Beta (entropy regularization): 0.005 (linear schedule)
	- Epsilon (PPO clip parameter): 0.2 (linear schedule)
	- Lambda (GAE parameter): 0.95
	- Number of Epochs: 3
	- Shared Critic: False

	### Network Architecture
	- Normalization: Enabled
	- Hidden Units: 512
	- Number of Layers: 3
	- Visual Encoding Type: Simple
	- Memory: None
	- Goal Conditioning Type: Hyper
	- Deterministic: False

	### Reward Configuration
	- Reward Type: Extrinsic
	- Gamma (discount factor): 0.995
	- Reward Strength: 1.0
	- Reward Network Hidden Units: 128
	- Reward Network Layers: 2

	### Training Parameters
	- Maximum Steps: 2,000,000
	- Time Horizon: 1,000
	- Summary Frequency: 50,000 steps
	- Checkpoint Interval: 200,000 steps
	- Keep Checkpoints: 15
	- Threaded Training: False

	## Training Performance

	### Performance Progression

	The agent showed steady improvement throughout training:

	Early Training (0-200k steps):
	- Step 50k: Mean Reward = 1.840 ± 0.925
	- Step 100k: Mean Reward = 2.747 ± 1.096
	- Step 150k: Mean Reward = 3.031 ± 1.174
	- Step 200k: Mean Reward = 3.538 ± 1.370

	Mid Training (200k-1M steps):
	- Performance stabilized around 3.6-3.9 mean reward
	- Peak performance at 500k steps: 3.873 ± 1.783

	Late Training (1M-2M steps):
	- Consistent performance around 3.5-3.8 mean reward
	- Final performance at 2M steps: 3.718 ± 2.132

	### Key Performance Metrics

	- Training Duration: 2,350.439 seconds (~39 minutes)
	- Final Mean Reward: 3.718
	- Final Standard Deviation: 2.132
	- Peak Mean Reward: 3.873 (at 500k steps)
	- Lowest Standard Deviation: 0.925 (at 50k steps)

	## Training Characteristics

	### Learning Curve Analysis
	1. Rapid Initial Learning: Significant improvement in first 200k steps (1.84 → 3.54)
	2. Plateau Phase: Performance stabilized between 200k-2M steps
	3. Variance Increase: Standard deviation increased over time, indicating more diverse behavior patterns

	### Model Checkpoints
	Regular ONNX model exports were created every 200k steps:
	- Huggy-199933.onnx
	- Huggy-399938.onnx
	- Huggy-599920.onnx
	- Huggy-799966.onnx
	- Huggy-999748.onnx
	- Huggy-1199265.onnx
	- Huggy-1399932.onnx
	- Huggy-1599985.onnx
	- Huggy-1799997.onnx
	- Huggy-1999614.onnx
	- Final Model: Huggy-2000364.onnx

	## Technical Implementation

	### Training Framework
	- Unity ML-Agents with PPO algorithm
	- Custom Unity environment integration
	- ONNX model export for deployment
	- Real-time training monitoring

	### Model Architecture Details
	- Multi-layer perceptron with 3 hidden layers
	- 512 hidden units per layer
	- Input normalization enabled
	- Separate actor-critic networks (shared_critic = False)
	- Hypernetwork goal conditioning

	### Reward Signal Processing
	- Single extrinsic reward signal
	- Discount factor of 0.995 for long-term planning
	- Dedicated reward network with 2 layers and 128 units

	## Performance Insights

	### Strengths
	- Consistent learning progression
	- Stable final performance around 3.7 mean reward
	- Successful completion of 2M training steps
	- Regular checkpoint generation for model versioning

	### Observations
	- Standard deviation increased over training, suggesting the agent learned more diverse strategies
	- Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration
	- The agent maintained stable performance without significant degradation

	### Training Efficiency
	- Steps per Second: ~851 steps/second average
	- Episodes per Checkpoint: Approximately 200-250 episodes per checkpoint
	- Memory Usage: Efficient with 20,480 buffer size and 1,000 time horizon

	This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics.
	# Huggy PPO Agent - Usage Guide

	## Prerequisites

	Before using the Huggy model, ensure you have the following installed:

	```bash
	# Install Unity ML-Agents
	pip install mlagents==1.2.0

	# Install required dependencies
	pip install torch==2.7.1
	pip install onnx
	pip install onnxruntime
	```

	## Model Files

	After training, you'll have these key files:
	- Huggy.onnx - The trained model (final version)
	- Huggy-2000364.onnx - Final checkpoint model
	- config.yaml - Training configuration file
	- training logs - Performance metrics and tensorboard data

	## Loading and Using the Model

	### Method 1: Using ML-Agents Python API

	```python
	from mlagents_envs.environment import UnityEnvironment
	from mlagents_envs.base_env import ActionTuple
	import numpy as np

	# Load the Unity environment
	env = UnityEnvironment(file_name="path/to/your/huggy_environment")

	# Reset the environment
	env.reset()

	# Get behavior specs
	behavior_names = list(env.behavior_specs.keys())
	behavior_name = behavior_names[0] # "Huggy"
	spec = env.behavior_specs[behavior_name]

	print(f"Observation space: {spec.observation_specs}")
	print(f"Action space: {spec.action_spec}")
	```

	### Method 2: Using ONNX Runtime for Inference

	```python
	import onnxruntime as ort
	import numpy as np

	# Load the trained ONNX model
	model_path = "results/Huggy2/Huggy.onnx"
	ort_session = ort.InferenceSession(model_path)

	# Get model input/output info
	input_name = ort_session.get_inputs()[0].name
	output_name = ort_session.get_outputs()[0].name

	def predict_action(observation):
	"""
	Predict action using the trained model
	"""
	# Prepare observation (ensure correct shape and normalization)
	obs_input = np.array(observation, dtype=np.float32)

	# Run inference
	action_probs = ort_session.run([output_name], {input_name: obs_input})

	# Sample action from probabilities or take deterministic action
	action = np.argmax(action_probs[0]) # Deterministic
	# OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0]) # Stochastic

	return action
	```

	### Method 3: Running Trained Agent in Unity

	```python
	from mlagents_envs.environment import UnityEnvironment
	from mlagents_envs.base_env import ActionTuple
	import onnxruntime as ort
	import numpy as np

	# Initialize environment and model
	env = UnityEnvironment(file_name="HuggyEnvironment")
	ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx")

	# Get behavior name
	behavior_names = list(env.behavior_specs.keys())
	behavior_name = behavior_names[0]

	# Run episodes
	for episode in range(10):
	env.reset()
	decision_steps, terminal_steps = env.get_steps(behavior_name)

	episode_reward = 0
	step_count = 0

	while len(decision_steps) > 0:
	# Get observations
	observations = decision_steps.obs[0]

	# Predict actions using trained model
	actions = []
	for obs in observations:
	action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)})
	action = np.argmax(action_probs[0])
	actions.append(action)

	# Send actions to environment
	action_tuple = ActionTuple(discrete=np.array([actions]))
	env.set_actions(behavior_name, action_tuple)

	# Step environment
	env.step()
	decision_steps, terminal_steps = env.get_steps(behavior_name)

	# Track rewards
	if len(terminal_steps) > 0:
	episode_reward += terminal_steps.reward[0]
	break
	if len(decision_steps) > 0:
	episode_reward += decision_steps.reward[0]

	step_count += 1

	print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}")

	env.close()
	```

	## Evaluation and Testing

	### Performance Evaluation Script

	```python
	import numpy as np
	from collections import defaultdict

	def evaluate_model(env, model_session, num_episodes=100):
	"""
	Evaluate the trained model performance
	"""
	results = {
	'rewards': [],
	'episode_lengths': [],
	'success_rate': 0
	}

	behavior_name = list(env.behavior_specs.keys())[0]

	for episode in range(num_episodes):
	env.reset()
	decision_steps, terminal_steps = env.get_steps(behavior_name)

	episode_reward = 0
	episode_length = 0

	while len(decision_steps) > 0:
	# Get actions from model
	observations = decision_steps.obs[0]
	actions = []

	for obs in observations:
	action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)})
	action = np.argmax(action_probs[0]) # Deterministic policy
	actions.append(action)

	# Step environment
	action_tuple = ActionTuple(discrete=np.array([actions]))
	env.set_actions(behavior_name, action_tuple)
	env.step()

	decision_steps, terminal_steps = env.get_steps(behavior_name)
	episode_length += 1

	# Check for episode termination
	if len(terminal_steps) > 0:
	episode_reward = terminal_steps.reward[0]
	break

	results['rewards'].append(episode_reward)
	results['episode_lengths'].append(episode_length)

	# Calculate statistics
	mean_reward = np.mean(results['rewards'])
	std_reward = np.std(results['rewards'])
	mean_length = np.mean(results['episode_lengths'])

	print(f"Evaluation Results ({num_episodes} episodes):")
	print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}")
	print(f"Mean Episode Length: {mean_length:.1f}")
	print(f"Min Reward: {np.min(results['rewards']):.3f}")
	print(f"Max Reward: {np.max(results['rewards']):.3f}")

	return results
	```

	## Deployment Options

	### Option 1: Unity Standalone Build
	1. Build your Unity environment with the trained model
	2. The model will automatically use the ONNX file for inference
	3. Deploy as a standalone executable

	### Option 2: Python Integration
	```python
	# For integration into larger Python applications
	class HuggyAgent:
	def __init__(self, model_path):
	self.session = ort.InferenceSession(model_path)
	self.input_name = self.session.get_inputs()[0].name

	def act(self, observation):
	"""Get action from observation"""
	obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
	action_probs = self.session.run(None, {self.input_name: obs_input})
	return np.argmax(action_probs[0])

	def act_stochastic(self, observation):
	"""Get stochastic action from observation"""
	obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
	action_probs = self.session.run(None, {self.input_name: obs_input})[0]
	return np.random.choice(len(action_probs), p=action_probs)

	# Usage
	agent = HuggyAgent("results/Huggy2/Huggy.onnx")
	action = agent.act(current_observation)
	```

	### Option 3: Web Deployment
	```python
	# For web applications using Flask/FastAPI
	from flask import Flask, request, jsonify
	import onnxruntime as ort
	import numpy as np

	app = Flask(__name__)
	model = ort.InferenceSession("Huggy.onnx")

	@app.route('/predict', methods=['POST'])
	def predict():
	data = request.json
	observation = np.array(data['observation'], dtype=np.float32)

	action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)})
	action = int(np.argmax(action_probs[0]))

	return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))})

	if __name__ == '__main__':
	app.run(debug=True)
	```

	## Troubleshooting

	### Common Issues

	1. ONNX Model Loading Errors
	- Ensure ONNX runtime version compatibility
	- Check model file path and permissions

	2. Unity Environment Connection
	- Verify Unity environment executable path
	- Check port availability (default: 5004)

	3. Observation Shape Mismatches
	- Ensure observation preprocessing matches training
	- Check input normalization requirements

	4. Performance Issues
	- Use deterministic policy for consistent results
	- Consider batch inference for multiple agents

	### Performance Optimization

	```python
	# Batch processing for multiple agents
	def batch_predict(model_session, observations):
	"""Process multiple observations at once"""
	batch_obs = np.array(observations, dtype=np.float32)
	action_probs = model_session.run(None, {"obs_0": batch_obs})
	actions = np.argmax(action_probs[0], axis=1)
	return actions
	```

	This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment.