Huggbottle commited on
Commit
ec2af62
·
verified ·
1 Parent(s): 0c5a22e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +416 -6
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - custom-implementation
7
  - deep-rl-class
8
  model-index:
9
- - name: DeepRL_pixelcopter_policy
10
  results:
11
  - task:
12
  type: reinforcement-learning
@@ -16,12 +16,422 @@ model-index:
16
  type: Pixelcopter-PLE-v0
17
  metrics:
18
  - type: mean_reward
19
- value: 31.50 +/- 29.61
20
  name: mean_reward
21
  verified: false
22
  ---
 
23
 
24
- # **Reinforce** Agent playing **Pixelcopter-PLE-v0**
25
- This is a trained model of a **Reinforce** agent playing **Pixelcopter-PLE-v0** .
26
- To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction
27
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - custom-implementation
7
  - deep-rl-class
8
  model-index:
9
+ - name: Pixelcopter-RL
10
  results:
11
  - task:
12
  type: reinforcement-learning
 
16
  type: Pixelcopter-PLE-v0
17
  metrics:
18
  - type: mean_reward
19
+ value: 13.10 +/- 6.89
20
  name: mean_reward
21
  verified: false
22
  ---
23
+ # REINFORCE Agent for Pixelcopter-PLE-v0
24
 
25
+ ## Model Description
26
+
27
+ This repository contains a trained REINFORCE (Policy Gradient) reinforcement learning agent that has learned to play Pixelcopter-PLE-v0, a challenging helicopter navigation game from the PyGame Learning Environment (PLE). The agent uses policy gradient methods to learn optimal flight control strategies through trial and error.
28
+
29
+ ### Model Details
30
+
31
+ - **Algorithm**: REINFORCE (Monte Carlo Policy Gradient)
32
+ - **Environment**: Pixelcopter-PLE-v0 (PyGame Learning Environment)
33
+ - **Framework**: Custom implementation following Deep RL Course guidelines
34
+ - **Task Type**: Discrete Control (Binary Actions)
35
+ - **Action Space**: Discrete (2 actions: do nothing or thrust up)
36
+ - **Observation Space**: Visual/pixel-based or feature-based state representation
37
+
38
+ ### Environment Overview
39
+
40
+ Pixelcopter-PLE-v0 is a classic helicopter control game where:
41
+ - **Objective**: Navigate a helicopter through obstacles without crashing
42
+ - **Challenge**: Requires precise timing and control to avoid ceiling, floor, and obstacles
43
+ - **Physics**: Gravity constantly pulls the helicopter down; player must apply thrust to maintain altitude
44
+ - **Scoring**: Points are awarded for surviving longer and successfully navigating through gaps
45
+ - **Difficulty**: Requires learning temporal dependencies and precise action timing
46
+
47
+ ## Performance
48
+
49
+ The trained REINFORCE agent achieves the following performance metrics:
50
+
51
+ - **Mean Reward**: 13.10 ± 6.89
52
+ - **Performance Analysis**: This represents solid performance for this challenging environment
53
+ - **Consistency**: The standard deviation indicates moderate variability, which is expected for policy gradient methods
54
+
55
+ ### Performance Context
56
+
57
+ The mean reward of 13.10 demonstrates that the agent has successfully learned to:
58
+ - Navigate through multiple obstacles before crashing
59
+ - Balance altitude control against obstacle avoidance
60
+ - Develop timing strategies for thrust application
61
+ - Achieve consistent survival beyond random baseline performance
62
+
63
+ The variability (±6.89) is characteristic of policy gradient methods and reflects the stochastic nature of the learned policy, which can lead to different episode outcomes based on exploration.
64
+
65
+ ## Algorithm: REINFORCE
66
+
67
+ REINFORCE is a foundational policy gradient algorithm that:
68
+ - **Direct Policy Learning**: Learns a parameterized policy directly (no value function)
69
+ - **Monte Carlo Updates**: Uses complete episode returns for policy updates
70
+ - **Policy Gradient**: Updates policy parameters in direction of higher expected returns
71
+ - **Stochastic Policy**: Learns probabilistic action selection for exploration
72
+
73
+ ### Key Advantages
74
+ - Simple and intuitive policy gradient approach
75
+ - Works well with discrete and continuous action spaces
76
+ - No need for value function approximation
77
+ - Good educational foundation for understanding policy gradients
78
+
79
+ ## Usage
80
+
81
+ ### Installation Requirements
82
+
83
+ ```bash
84
+ # Core dependencies
85
+ pip install torch torchvision
86
+ pip install gymnasium
87
+ pip install pygame-learning-environment
88
+ pip install numpy matplotlib
89
+
90
+ # For visualization and analysis
91
+ pip install pillow
92
+ pip install imageio # for gif creation
93
+ ```
94
+
95
+ ### Loading and Using the Model
96
+
97
+ ```python
98
+ import torch
99
+ import gymnasium as gym
100
+ from ple import PLE
101
+ from ple.games.pixelcopter import Pixelcopter
102
+ import numpy as np
103
+
104
+ # Load the trained model
105
+ # Note: Adjust path based on your model file structure
106
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
107
+ model = torch.load("pixelcopter_reinforce_model.pth", map_location=device)
108
+ model.eval()
109
+
110
+ # Create the environment
111
+ def create_pixelcopter_env():
112
+ game = Pixelcopter()
113
+ env = PLE(game, fps=30, display=True) # Set display=False for headless
114
+ return env
115
+
116
+ # Initialize environment
117
+ env = create_pixelcopter_env()
118
+ env.init()
119
+
120
+ # Run trained agent
121
+ def run_agent(model, env, episodes=5):
122
+ total_rewards = []
123
+
124
+ for episode in range(episodes):
125
+ env.reset_game()
126
+ episode_reward = 0
127
+
128
+ while not env.game_over():
129
+ # Get current state
130
+ state = env.getScreenRGB() # or env.getGameState() if using features
131
+ state = preprocess_state(state) # Apply your preprocessing
132
+
133
+ # Convert to tensor
134
+ state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
135
+
136
+ # Get action probabilities
137
+ with torch.no_grad():
138
+ action_probs = model(state_tensor)
139
+ action = torch.multinomial(action_probs, 1).item()
140
+
141
+ # Execute action (0: do nothing, 1: thrust)
142
+ reward = env.act(action)
143
+ episode_reward += reward
144
+
145
+ total_rewards.append(episode_reward)
146
+ print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")
147
+
148
+ mean_reward = np.mean(total_rewards)
149
+ std_reward = np.std(total_rewards)
150
+ print(f"\nAverage Performance: {mean_reward:.2f} ± {std_reward:.2f}")
151
+
152
+ return total_rewards
153
+
154
+ # Preprocessing function (adjust based on your model's input requirements)
155
+ def preprocess_state(state):
156
+ """
157
+ Preprocess the game state for the neural network
158
+ This should match the preprocessing used during training
159
+ """
160
+ if isinstance(state, np.ndarray) and len(state.shape) == 3:
161
+ # If using image input
162
+ state = np.transpose(state, (2, 0, 1)) # Channel first
163
+ state = state / 255.0 # Normalize pixels
164
+ return state.flatten() # or keep as image depending on model
165
+ else:
166
+ # If using game state features
167
+ return np.array(list(state.values()))
168
+
169
+ # Run the agent
170
+ rewards = run_agent(model, env, episodes=10)
171
+ ```
172
+
173
+ ### Training Your Own Agent
174
+
175
+ ```python
176
+ import torch
177
+ import torch.nn as nn
178
+ import torch.optim as optim
179
+ import numpy as np
180
+ from collections import deque
181
+
182
+ class PolicyNetwork(nn.Module):
183
+ def __init__(self, state_size, action_size, hidden_size=64):
184
+ super(PolicyNetwork, self).__init__()
185
+ self.fc1 = nn.Linear(state_size, hidden_size)
186
+ self.fc2 = nn.Linear(hidden_size, hidden_size)
187
+ self.fc3 = nn.Linear(hidden_size, action_size)
188
+ self.softmax = nn.Softmax(dim=1)
189
+
190
+ def forward(self, x):
191
+ x = torch.relu(self.fc1(x))
192
+ x = torch.relu(self.fc2(x))
193
+ x = self.fc3(x)
194
+ return self.softmax(x)
195
+
196
+ class REINFORCEAgent:
197
+ def __init__(self, state_size, action_size, lr=0.001):
198
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
199
+ self.policy_net = PolicyNetwork(state_size, action_size).to(self.device)
200
+ self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
201
+
202
+ self.saved_log_probs = []
203
+ self.rewards = []
204
+
205
+ def select_action(self, state):
206
+ state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
207
+ probs = self.policy_net(state)
208
+ action = torch.multinomial(probs, 1)
209
+
210
+ self.saved_log_probs.append(torch.log(probs.squeeze(0)[action]))
211
+ return action.item()
212
+
213
+ def update_policy(self, gamma=0.99):
214
+ # Calculate discounted rewards
215
+ discounted_rewards = []
216
+ R = 0
217
+
218
+ for r in reversed(self.rewards):
219
+ R = r + gamma * R
220
+ discounted_rewards.insert(0, R)
221
+
222
+ # Normalize rewards
223
+ discounted_rewards = torch.FloatTensor(discounted_rewards).to(self.device)
224
+ discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8)
225
+
226
+ # Calculate policy loss
227
+ policy_loss = []
228
+ for log_prob, reward in zip(self.saved_log_probs, discounted_rewards):
229
+ policy_loss.append(-log_prob * reward)
230
+
231
+ # Update policy
232
+ self.optimizer.zero_grad()
233
+ policy_loss = torch.cat(policy_loss).sum()
234
+ policy_loss.backward()
235
+ self.optimizer.step()
236
+
237
+ # Clear episode data
238
+ self.saved_log_probs.clear()
239
+ self.rewards.clear()
240
+
241
+ return policy_loss.item()
242
+
243
+ def train_agent(episodes=2000):
244
+ env = create_pixelcopter_env()
245
+ env.init()
246
+
247
+ # Determine state size based on your preprocessing
248
+ state_size = len(preprocess_state(env.getScreenRGB())) # Adjust as needed
249
+ action_size = 2 # do nothing, thrust
250
+
251
+ agent = REINFORCEAgent(state_size, action_size)
252
+
253
+ scores = deque(maxlen=100)
254
+
255
+ for episode in range(episodes):
256
+ env.reset_game()
257
+ episode_reward = 0
258
+
259
+ while not env.game_over():
260
+ state = preprocess_state(env.getScreenRGB())
261
+ action = agent.select_action(state)
262
+
263
+ reward = env.act(action)
264
+ agent.rewards.append(reward)
265
+ episode_reward += reward
266
+
267
+ # Update policy after each episode
268
+ loss = agent.update_policy()
269
+ scores.append(episode_reward)
270
+
271
+ if episode % 100 == 0:
272
+ avg_score = np.mean(scores)
273
+ print(f"Episode {episode}, Average Score: {avg_score:.2f}, Loss: {loss:.4f}")
274
+
275
+ # Save the trained model
276
+ torch.save(agent.policy_net, "pixelcopter_reinforce_model.pth")
277
+ return agent
278
+
279
+ # Train a new agent
280
+ # trained_agent = train_agent()
281
+ ```
282
+
283
+ ### Evaluation and Analysis
284
+
285
+ ```python
286
+ import matplotlib.pyplot as plt
287
+
288
+ def evaluate_agent_detailed(model, env, episodes=50):
289
+ """Detailed evaluation with statistics and visualization"""
290
+ rewards = []
291
+ episode_lengths = []
292
+
293
+ for episode in range(episodes):
294
+ env.reset_game()
295
+ episode_reward = 0
296
+ steps = 0
297
+
298
+ while not env.game_over():
299
+ state = preprocess_state(env.getScreenRGB())
300
+ state_tensor = torch.FloatTensor(state).unsqueeze(0)
301
+
302
+ with torch.no_grad():
303
+ action_probs = model(state_tensor)
304
+ action = torch.multinomial(action_probs, 1).item()
305
+
306
+ reward = env.act(action)
307
+ episode_reward += reward
308
+ steps += 1
309
+
310
+ rewards.append(episode_reward)
311
+ episode_lengths.append(steps)
312
+
313
+ if (episode + 1) % 10 == 0:
314
+ print(f"Episodes {episode + 1}/{episodes} completed")
315
+
316
+ # Statistical analysis
317
+ mean_reward = np.mean(rewards)
318
+ std_reward = np.std(rewards)
319
+ median_reward = np.median(rewards)
320
+ max_reward = np.max(rewards)
321
+ min_reward = np.min(rewards)
322
+
323
+ mean_length = np.mean(episode_lengths)
324
+
325
+ print(f"\n--- Evaluation Results ---")
326
+ print(f"Episodes: {episodes}")
327
+ print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
328
+ print(f"Median Reward: {median_reward:.2f}")
329
+ print(f"Max Reward: {max_reward:.2f}")
330
+ print(f"Min Reward: {min_reward:.2f}")
331
+ print(f"Mean Episode Length: {mean_length:.1f} steps")
332
+
333
+ # Visualization
334
+ plt.figure(figsize=(12, 4))
335
+
336
+ plt.subplot(1, 2, 1)
337
+ plt.plot(rewards)
338
+ plt.axhline(y=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
339
+ plt.title('Episode Rewards')
340
+ plt.xlabel('Episode')
341
+ plt.ylabel('Reward')
342
+ plt.legend()
343
+
344
+ plt.subplot(1, 2, 2)
345
+ plt.hist(rewards, bins=20, alpha=0.7)
346
+ plt.axvline(x=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
347
+ plt.title('Reward Distribution')
348
+ plt.xlabel('Reward')
349
+ plt.ylabel('Frequency')
350
+ plt.legend()
351
+
352
+ plt.tight_layout()
353
+ plt.show()
354
+
355
+ return {
356
+ 'rewards': rewards,
357
+ 'episode_lengths': episode_lengths,
358
+ 'stats': {
359
+ 'mean': mean_reward,
360
+ 'std': std_reward,
361
+ 'median': median_reward,
362
+ 'max': max_reward,
363
+ 'min': min_reward
364
+ }
365
+ }
366
+
367
+ # Run detailed evaluation
368
+ # results = evaluate_agent_detailed(model, env, episodes=100)
369
+ ```
370
+
371
+ ## Training Information
372
+
373
+ ### Hyperparameters
374
+
375
+ The REINFORCE agent was trained with carefully tuned hyperparameters:
376
+ - **Learning Rate**: Optimized for stable policy gradient updates
377
+ - **Discount Factor (γ)**: Balances immediate vs. future rewards
378
+ - **Network Architecture**: Multi-layer perceptron with appropriate hidden dimensions
379
+ - **Episode Length**: Sufficient episodes to learn temporal patterns
380
+
381
+ ### Training Environment
382
+
383
+ - **State Representation**: Processed game screen or extracted features
384
+ - **Action Space**: Binary discrete actions (do nothing vs. thrust)
385
+ - **Reward Signal**: Game score progression with survival bonus
386
+ - **Training Episodes**: Extended training to achieve stable performance
387
+
388
+ ### Algorithm Characteristics
389
+
390
+ - **Sample Efficiency**: Moderate (typical for policy gradient methods)
391
+ - **Stability**: Good convergence with proper hyperparameter tuning
392
+ - **Exploration**: Built-in through stochastic policy
393
+ - **Interpretability**: Clear policy learning through gradient ascent
394
+
395
+ ## Limitations and Considerations
396
+
397
+ - **Sample Efficiency**: REINFORCE requires many episodes to learn effectively
398
+ - **Variance**: Policy gradient estimates can have high variance
399
+ - **Environment Specific**: Trained specifically for Pixelcopter game mechanics
400
+ - **Stochastic Performance**: Episode rewards vary due to policy stochasticity
401
+ - **Real-time Performance**: Inference speed suitable for real-time game play
402
+
403
+ ## Related Work and Extensions
404
+
405
+ This model serves as an excellent educational example for:
406
+ - **Policy Gradient Methods**: Understanding direct policy optimization
407
+ - **Deep Reinforcement Learning**: Practical implementation of RL algorithms
408
+ - **Game AI**: Learning complex temporal control tasks
409
+ - **Baseline Comparisons**: Foundation for more advanced algorithms (A2C, PPO, etc.)
410
+
411
+ ## Citation
412
+
413
+ If you use this model in your research or educational projects, please cite:
414
+
415
+ ```bibtex
416
+ @misc{pixelcopter_reinforce_2024,
417
+ title={REINFORCE Agent for Pixelcopter-PLE-v0},
418
+ author={Adilbai},
419
+ year={2024},
420
+ publisher={Hugging Face},
421
+ howpublished={\url{https://huggingface.co/Adilbai/Pixelcopter-RL}},
422
+ note={Trained following Deep RL Course Unit 4}
423
+ }
424
+ ```
425
+
426
+ ## Educational Resources
427
+
428
+ This model was developed following the **Deep Reinforcement Learning Course Unit 4**:
429
+ - **Course Link**: [https://huggingface.co/deep-rl-course/unit4/introduction](https://huggingface.co/deep-rl-course/unit4/introduction)
430
+ - **Topic**: Policy Gradient Methods and REINFORCE
431
+ - **Learning Objectives**: Understanding policy-based RL algorithms
432
+
433
+ For comprehensive learning about REINFORCE and policy gradient methods, refer to the complete course materials.
434
+
435
+ ## License
436
+
437
+ This model is distributed under the MIT License. The model is intended for educational and research purposes.