Adilbai commited on
Commit
74f9393
·
verified ·
1 Parent(s): ecac250

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +426 -18
README.md CHANGED
@@ -10,26 +10,434 @@ tags:
10
  # **ppo** Agent playing **Huggy**
11
  This is a trained model of a **ppo** agent playing **Huggy**
12
  using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).
 
13
 
14
- ## Usage (with ML-Agents)
15
- The Documentation: https://unity-technologies.github.io/ml-agents/ML-Agents-Toolkit-Documentation/
16
 
17
- We wrote a complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub:
18
- - A *short tutorial* where you teach Huggy the Dog 🐶 to fetch the stick and then play with him directly in your
19
- browser: https://huggingface.co/learn/deep-rl-course/unitbonus1/introduction
20
- - A *longer tutorial* to understand how works ML-Agents:
21
- https://huggingface.co/learn/deep-rl-course/unit5/introduction
22
 
23
- ### Resume the training
24
- ```bash
25
- mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
26
- ```
27
 
28
- ### Watch your Agent play
29
- You can watch your agent **playing directly in your browser**
 
 
 
 
30
 
31
- 1. If the environment is part of ML-Agents official environments, go to https://huggingface.co/unity
32
- 2. Step 1: Find your model_id: Adilbai/ppo-Huggy-Rl-agent
33
- 3. Step 2: Select your *.nn /*.onnx file
34
- 4. Click on Watch the agent play 👀
35
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  # **ppo** Agent playing **Huggy**
11
  This is a trained model of a **ppo** agent playing **Huggy**
12
  using the [Unity ML-Agents Library](https://github.com/Unity-Technologies/ml-agents).
13
+ # Huggy PPO Agent - Training Documentation
14
 
15
+ ## Model Overview
 
16
 
17
+ **Huggy** is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps.
 
 
 
 
18
 
19
+ ## Training Environment
 
 
 
20
 
21
+ - **Environment**: Unity ML-Agents custom environment "Huggy"
22
+ - **ML-Agents Version**: 1.2.0.dev0
23
+ - **ML-Agents Envs**: 1.2.0.dev0
24
+ - **Communicator API**: 1.5.0
25
+ - **PyTorch Version**: 2.7.1+cu126
26
+ - **Unity Package Version**: 2.2.1-exp.1
27
 
28
+ ## Training Configuration
29
+
30
+ ### PPO Hyperparameters
31
+ - **Batch Size**: 2,048
32
+ - **Buffer Size**: 20,480
33
+ - **Learning Rate**: 0.0003 (linear schedule)
34
+ - **Beta (entropy regularization)**: 0.005 (linear schedule)
35
+ - **Epsilon (PPO clip parameter)**: 0.2 (linear schedule)
36
+ - **Lambda (GAE parameter)**: 0.95
37
+ - **Number of Epochs**: 3
38
+ - **Shared Critic**: False
39
+
40
+ ### Network Architecture
41
+ - **Normalization**: Enabled
42
+ - **Hidden Units**: 512
43
+ - **Number of Layers**: 3
44
+ - **Visual Encoding Type**: Simple
45
+ - **Memory**: None
46
+ - **Goal Conditioning Type**: Hyper
47
+ - **Deterministic**: False
48
+
49
+ ### Reward Configuration
50
+ - **Reward Type**: Extrinsic
51
+ - **Gamma (discount factor)**: 0.995
52
+ - **Reward Strength**: 1.0
53
+ - **Reward Network Hidden Units**: 128
54
+ - **Reward Network Layers**: 2
55
+
56
+ ### Training Parameters
57
+ - **Maximum Steps**: 2,000,000
58
+ - **Time Horizon**: 1,000
59
+ - **Summary Frequency**: 50,000 steps
60
+ - **Checkpoint Interval**: 200,000 steps
61
+ - **Keep Checkpoints**: 15
62
+ - **Threaded Training**: False
63
+
64
+ ## Training Performance
65
+
66
+ ### Performance Progression
67
+
68
+ The agent showed steady improvement throughout training:
69
+
70
+ **Early Training (0-200k steps):**
71
+ - Step 50k: Mean Reward = 1.840 ± 0.925
72
+ - Step 100k: Mean Reward = 2.747 ± 1.096
73
+ - Step 150k: Mean Reward = 3.031 ± 1.174
74
+ - Step 200k: Mean Reward = 3.538 ± 1.370
75
+
76
+ **Mid Training (200k-1M steps):**
77
+ - Performance stabilized around 3.6-3.9 mean reward
78
+ - Peak performance at 500k steps: 3.873 ± 1.783
79
+
80
+ **Late Training (1M-2M steps):**
81
+ - Consistent performance around 3.5-3.8 mean reward
82
+ - Final performance at 2M steps: 3.718 ± 2.132
83
+
84
+ ### Key Performance Metrics
85
+
86
+ - **Training Duration**: 2,350.439 seconds (~39 minutes)
87
+ - **Final Mean Reward**: 3.718
88
+ - **Final Standard Deviation**: 2.132
89
+ - **Peak Mean Reward**: 3.873 (at 500k steps)
90
+ - **Lowest Standard Deviation**: 0.925 (at 50k steps)
91
+
92
+ ## Training Characteristics
93
+
94
+ ### Learning Curve Analysis
95
+ 1. **Rapid Initial Learning**: Significant improvement in first 200k steps (1.84 → 3.54)
96
+ 2. **Plateau Phase**: Performance stabilized between 200k-2M steps
97
+ 3. **Variance Increase**: Standard deviation increased over time, indicating more diverse behavior patterns
98
+
99
+ ### Model Checkpoints
100
+ Regular ONNX model exports were created every 200k steps:
101
+ - Huggy-199933.onnx
102
+ - Huggy-399938.onnx
103
+ - Huggy-599920.onnx
104
+ - Huggy-799966.onnx
105
+ - Huggy-999748.onnx
106
+ - Huggy-1199265.onnx
107
+ - Huggy-1399932.onnx
108
+ - Huggy-1599985.onnx
109
+ - Huggy-1799997.onnx
110
+ - Huggy-1999614.onnx
111
+ - **Final Model**: Huggy-2000364.onnx
112
+
113
+ ## Technical Implementation
114
+
115
+ ### Training Framework
116
+ - Unity ML-Agents with PPO algorithm
117
+ - Custom Unity environment integration
118
+ - ONNX model export for deployment
119
+ - Real-time training monitoring
120
+
121
+ ### Model Architecture Details
122
+ - Multi-layer perceptron with 3 hidden layers
123
+ - 512 hidden units per layer
124
+ - Input normalization enabled
125
+ - Separate actor-critic networks (shared_critic = False)
126
+ - Hypernetwork goal conditioning
127
+
128
+ ### Reward Signal Processing
129
+ - Single extrinsic reward signal
130
+ - Discount factor of 0.995 for long-term planning
131
+ - Dedicated reward network with 2 layers and 128 units
132
+
133
+ ## Performance Insights
134
+
135
+ ### Strengths
136
+ - Consistent learning progression
137
+ - Stable final performance around 3.7 mean reward
138
+ - Successful completion of 2M training steps
139
+ - Regular checkpoint generation for model versioning
140
+
141
+ ### Observations
142
+ - Standard deviation increased over training, suggesting the agent learned more diverse strategies
143
+ - Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration
144
+ - The agent maintained stable performance without significant degradation
145
+
146
+ ### Training Efficiency
147
+ - **Steps per Second**: ~851 steps/second average
148
+ - **Episodes per Checkpoint**: Approximately 200-250 episodes per checkpoint
149
+ - **Memory Usage**: Efficient with 20,480 buffer size and 1,000 time horizon
150
+
151
+ This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics.
152
+ # Huggy PPO Agent - Usage Guide
153
+
154
+ ## Prerequisites
155
+
156
+ Before using the Huggy model, ensure you have the following installed:
157
+
158
+ ```bash
159
+ # Install Unity ML-Agents
160
+ pip install mlagents==1.2.0
161
+
162
+ # Install required dependencies
163
+ pip install torch==2.7.1
164
+ pip install onnx
165
+ pip install onnxruntime
166
+ ```
167
+
168
+ ## Model Files
169
+
170
+ After training, you'll have these key files:
171
+ - **Huggy.onnx** - The trained model (final version)
172
+ - **Huggy-2000364.onnx** - Final checkpoint model
173
+ - **config.yaml** - Training configuration file
174
+ - **training logs** - Performance metrics and tensorboard data
175
+
176
+ ## Loading and Using the Model
177
+
178
+ ### Method 1: Using ML-Agents Python API
179
+
180
+ ```python
181
+ from mlagents_envs.environment import UnityEnvironment
182
+ from mlagents_envs.base_env import ActionTuple
183
+ import numpy as np
184
+
185
+ # Load the Unity environment
186
+ env = UnityEnvironment(file_name="path/to/your/huggy_environment")
187
+
188
+ # Reset the environment
189
+ env.reset()
190
+
191
+ # Get behavior specs
192
+ behavior_names = list(env.behavior_specs.keys())
193
+ behavior_name = behavior_names[0] # "Huggy"
194
+ spec = env.behavior_specs[behavior_name]
195
+
196
+ print(f"Observation space: {spec.observation_specs}")
197
+ print(f"Action space: {spec.action_spec}")
198
+ ```
199
+
200
+ ### Method 2: Using ONNX Runtime for Inference
201
+
202
+ ```python
203
+ import onnxruntime as ort
204
+ import numpy as np
205
+
206
+ # Load the trained ONNX model
207
+ model_path = "results/Huggy2/Huggy.onnx"
208
+ ort_session = ort.InferenceSession(model_path)
209
+
210
+ # Get model input/output info
211
+ input_name = ort_session.get_inputs()[0].name
212
+ output_name = ort_session.get_outputs()[0].name
213
+
214
+ def predict_action(observation):
215
+ """
216
+ Predict action using the trained model
217
+ """
218
+ # Prepare observation (ensure correct shape and normalization)
219
+ obs_input = np.array(observation, dtype=np.float32)
220
+
221
+ # Run inference
222
+ action_probs = ort_session.run([output_name], {input_name: obs_input})
223
+
224
+ # Sample action from probabilities or take deterministic action
225
+ action = np.argmax(action_probs[0]) # Deterministic
226
+ # OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0]) # Stochastic
227
+
228
+ return action
229
+ ```
230
+
231
+ ### Method 3: Running Trained Agent in Unity
232
+
233
+ ```python
234
+ from mlagents_envs.environment import UnityEnvironment
235
+ from mlagents_envs.base_env import ActionTuple
236
+ import onnxruntime as ort
237
+ import numpy as np
238
+
239
+ # Initialize environment and model
240
+ env = UnityEnvironment(file_name="HuggyEnvironment")
241
+ ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx")
242
+
243
+ # Get behavior name
244
+ behavior_names = list(env.behavior_specs.keys())
245
+ behavior_name = behavior_names[0]
246
+
247
+ # Run episodes
248
+ for episode in range(10):
249
+ env.reset()
250
+ decision_steps, terminal_steps = env.get_steps(behavior_name)
251
+
252
+ episode_reward = 0
253
+ step_count = 0
254
+
255
+ while len(decision_steps) > 0:
256
+ # Get observations
257
+ observations = decision_steps.obs[0]
258
+
259
+ # Predict actions using trained model
260
+ actions = []
261
+ for obs in observations:
262
+ action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)})
263
+ action = np.argmax(action_probs[0])
264
+ actions.append(action)
265
+
266
+ # Send actions to environment
267
+ action_tuple = ActionTuple(discrete=np.array([actions]))
268
+ env.set_actions(behavior_name, action_tuple)
269
+
270
+ # Step environment
271
+ env.step()
272
+ decision_steps, terminal_steps = env.get_steps(behavior_name)
273
+
274
+ # Track rewards
275
+ if len(terminal_steps) > 0:
276
+ episode_reward += terminal_steps.reward[0]
277
+ break
278
+ if len(decision_steps) > 0:
279
+ episode_reward += decision_steps.reward[0]
280
+
281
+ step_count += 1
282
+
283
+ print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}")
284
+
285
+ env.close()
286
+ ```
287
+
288
+ ## Evaluation and Testing
289
+
290
+ ### Performance Evaluation Script
291
+
292
+ ```python
293
+ import numpy as np
294
+ from collections import defaultdict
295
+
296
+ def evaluate_model(env, model_session, num_episodes=100):
297
+ """
298
+ Evaluate the trained model performance
299
+ """
300
+ results = {
301
+ 'rewards': [],
302
+ 'episode_lengths': [],
303
+ 'success_rate': 0
304
+ }
305
+
306
+ behavior_name = list(env.behavior_specs.keys())[0]
307
+
308
+ for episode in range(num_episodes):
309
+ env.reset()
310
+ decision_steps, terminal_steps = env.get_steps(behavior_name)
311
+
312
+ episode_reward = 0
313
+ episode_length = 0
314
+
315
+ while len(decision_steps) > 0:
316
+ # Get actions from model
317
+ observations = decision_steps.obs[0]
318
+ actions = []
319
+
320
+ for obs in observations:
321
+ action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)})
322
+ action = np.argmax(action_probs[0]) # Deterministic policy
323
+ actions.append(action)
324
+
325
+ # Step environment
326
+ action_tuple = ActionTuple(discrete=np.array([actions]))
327
+ env.set_actions(behavior_name, action_tuple)
328
+ env.step()
329
+
330
+ decision_steps, terminal_steps = env.get_steps(behavior_name)
331
+ episode_length += 1
332
+
333
+ # Check for episode termination
334
+ if len(terminal_steps) > 0:
335
+ episode_reward = terminal_steps.reward[0]
336
+ break
337
+
338
+ results['rewards'].append(episode_reward)
339
+ results['episode_lengths'].append(episode_length)
340
+
341
+ # Calculate statistics
342
+ mean_reward = np.mean(results['rewards'])
343
+ std_reward = np.std(results['rewards'])
344
+ mean_length = np.mean(results['episode_lengths'])
345
+
346
+ print(f"Evaluation Results ({num_episodes} episodes):")
347
+ print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}")
348
+ print(f"Mean Episode Length: {mean_length:.1f}")
349
+ print(f"Min Reward: {np.min(results['rewards']):.3f}")
350
+ print(f"Max Reward: {np.max(results['rewards']):.3f}")
351
+
352
+ return results
353
+ ```
354
+
355
+ ## Deployment Options
356
+
357
+ ### Option 1: Unity Standalone Build
358
+ 1. Build your Unity environment with the trained model
359
+ 2. The model will automatically use the ONNX file for inference
360
+ 3. Deploy as a standalone executable
361
+
362
+ ### Option 2: Python Integration
363
+ ```python
364
+ # For integration into larger Python applications
365
+ class HuggyAgent:
366
+ def __init__(self, model_path):
367
+ self.session = ort.InferenceSession(model_path)
368
+ self.input_name = self.session.get_inputs()[0].name
369
+
370
+ def act(self, observation):
371
+ """Get action from observation"""
372
+ obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
373
+ action_probs = self.session.run(None, {self.input_name: obs_input})
374
+ return np.argmax(action_probs[0])
375
+
376
+ def act_stochastic(self, observation):
377
+ """Get stochastic action from observation"""
378
+ obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
379
+ action_probs = self.session.run(None, {self.input_name: obs_input})[0]
380
+ return np.random.choice(len(action_probs), p=action_probs)
381
+
382
+ # Usage
383
+ agent = HuggyAgent("results/Huggy2/Huggy.onnx")
384
+ action = agent.act(current_observation)
385
+ ```
386
+
387
+ ### Option 3: Web Deployment
388
+ ```python
389
+ # For web applications using Flask/FastAPI
390
+ from flask import Flask, request, jsonify
391
+ import onnxruntime as ort
392
+ import numpy as np
393
+
394
+ app = Flask(__name__)
395
+ model = ort.InferenceSession("Huggy.onnx")
396
+
397
+ @app.route('/predict', methods=['POST'])
398
+ def predict():
399
+ data = request.json
400
+ observation = np.array(data['observation'], dtype=np.float32)
401
+
402
+ action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)})
403
+ action = int(np.argmax(action_probs[0]))
404
+
405
+ return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))})
406
+
407
+ if __name__ == '__main__':
408
+ app.run(debug=True)
409
+ ```
410
+
411
+ ## Troubleshooting
412
+
413
+ ### Common Issues
414
+
415
+ 1. **ONNX Model Loading Errors**
416
+ - Ensure ONNX runtime version compatibility
417
+ - Check model file path and permissions
418
+
419
+ 2. **Unity Environment Connection**
420
+ - Verify Unity environment executable path
421
+ - Check port availability (default: 5004)
422
+
423
+ 3. **Observation Shape Mismatches**
424
+ - Ensure observation preprocessing matches training
425
+ - Check input normalization requirements
426
+
427
+ 4. **Performance Issues**
428
+ - Use deterministic policy for consistent results
429
+ - Consider batch inference for multiple agents
430
+
431
+ ### Performance Optimization
432
+
433
+ ```python
434
+ # Batch processing for multiple agents
435
+ def batch_predict(model_session, observations):
436
+ """Process multiple observations at once"""
437
+ batch_obs = np.array(observations, dtype=np.float32)
438
+ action_probs = model_session.run(None, {"obs_0": batch_obs})
439
+ actions = np.argmax(action_probs[0], axis=1)
440
+ return actions
441
+ ```
442
+
443
+ This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment.