Huggbottle commited on
Commit
0c5a22e
·
verified ·
1 Parent(s): c3785d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -416
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - custom-implementation
7
  - deep-rl-class
8
  model-index:
9
- - name: Pixelcopter-RL
10
  results:
11
  - task:
12
  type: reinforcement-learning
@@ -16,422 +16,12 @@ model-index:
16
  type: Pixelcopter-PLE-v0
17
  metrics:
18
  - type: mean_reward
19
- value: 13.10 +/- 6.89
20
  name: mean_reward
21
  verified: false
22
  ---
23
- # REINFORCE Agent for Pixelcopter-PLE-v0
24
 
25
- ## Model Description
26
-
27
- This repository contains a trained REINFORCE (Policy Gradient) reinforcement learning agent that has learned to play Pixelcopter-PLE-v0, a challenging helicopter navigation game from the PyGame Learning Environment (PLE). The agent uses policy gradient methods to learn optimal flight control strategies through trial and error.
28
-
29
- ### Model Details
30
-
31
- - **Algorithm**: REINFORCE (Monte Carlo Policy Gradient)
32
- - **Environment**: Pixelcopter-PLE-v0 (PyGame Learning Environment)
33
- - **Framework**: Custom implementation following Deep RL Course guidelines
34
- - **Task Type**: Discrete Control (Binary Actions)
35
- - **Action Space**: Discrete (2 actions: do nothing or thrust up)
36
- - **Observation Space**: Visual/pixel-based or feature-based state representation
37
-
38
- ### Environment Overview
39
-
40
- Pixelcopter-PLE-v0 is a classic helicopter control game where:
41
- - **Objective**: Navigate a helicopter through obstacles without crashing
42
- - **Challenge**: Requires precise timing and control to avoid ceiling, floor, and obstacles
43
- - **Physics**: Gravity constantly pulls the helicopter down; player must apply thrust to maintain altitude
44
- - **Scoring**: Points are awarded for surviving longer and successfully navigating through gaps
45
- - **Difficulty**: Requires learning temporal dependencies and precise action timing
46
-
47
- ## Performance
48
-
49
- The trained REINFORCE agent achieves the following performance metrics:
50
-
51
- - **Mean Reward**: 13.10 ± 6.89
52
- - **Performance Analysis**: This represents solid performance for this challenging environment
53
- - **Consistency**: The standard deviation indicates moderate variability, which is expected for policy gradient methods
54
-
55
- ### Performance Context
56
-
57
- The mean reward of 13.10 demonstrates that the agent has successfully learned to:
58
- - Navigate through multiple obstacles before crashing
59
- - Balance altitude control against obstacle avoidance
60
- - Develop timing strategies for thrust application
61
- - Achieve consistent survival beyond random baseline performance
62
-
63
- The variability (±6.89) is characteristic of policy gradient methods and reflects the stochastic nature of the learned policy, which can lead to different episode outcomes based on exploration.
64
-
65
- ## Algorithm: REINFORCE
66
-
67
- REINFORCE is a foundational policy gradient algorithm that:
68
- - **Direct Policy Learning**: Learns a parameterized policy directly (no value function)
69
- - **Monte Carlo Updates**: Uses complete episode returns for policy updates
70
- - **Policy Gradient**: Updates policy parameters in direction of higher expected returns
71
- - **Stochastic Policy**: Learns probabilistic action selection for exploration
72
-
73
- ### Key Advantages
74
- - Simple and intuitive policy gradient approach
75
- - Works well with discrete and continuous action spaces
76
- - No need for value function approximation
77
- - Good educational foundation for understanding policy gradients
78
-
79
- ## Usage
80
-
81
- ### Installation Requirements
82
-
83
- ```bash
84
- # Core dependencies
85
- pip install torch torchvision
86
- pip install gymnasium
87
- pip install pygame-learning-environment
88
- pip install numpy matplotlib
89
-
90
- # For visualization and analysis
91
- pip install pillow
92
- pip install imageio # for gif creation
93
- ```
94
-
95
- ### Loading and Using the Model
96
-
97
- ```python
98
- import torch
99
- import gymnasium as gym
100
- from ple import PLE
101
- from ple.games.pixelcopter import Pixelcopter
102
- import numpy as np
103
-
104
- # Load the trained model
105
- # Note: Adjust path based on your model file structure
106
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
107
- model = torch.load("pixelcopter_reinforce_model.pth", map_location=device)
108
- model.eval()
109
-
110
- # Create the environment
111
- def create_pixelcopter_env():
112
- game = Pixelcopter()
113
- env = PLE(game, fps=30, display=True) # Set display=False for headless
114
- return env
115
-
116
- # Initialize environment
117
- env = create_pixelcopter_env()
118
- env.init()
119
-
120
- # Run trained agent
121
- def run_agent(model, env, episodes=5):
122
- total_rewards = []
123
-
124
- for episode in range(episodes):
125
- env.reset_game()
126
- episode_reward = 0
127
-
128
- while not env.game_over():
129
- # Get current state
130
- state = env.getScreenRGB() # or env.getGameState() if using features
131
- state = preprocess_state(state) # Apply your preprocessing
132
-
133
- # Convert to tensor
134
- state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
135
-
136
- # Get action probabilities
137
- with torch.no_grad():
138
- action_probs = model(state_tensor)
139
- action = torch.multinomial(action_probs, 1).item()
140
-
141
- # Execute action (0: do nothing, 1: thrust)
142
- reward = env.act(action)
143
- episode_reward += reward
144
-
145
- total_rewards.append(episode_reward)
146
- print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")
147
-
148
- mean_reward = np.mean(total_rewards)
149
- std_reward = np.std(total_rewards)
150
- print(f"\nAverage Performance: {mean_reward:.2f} ± {std_reward:.2f}")
151
-
152
- return total_rewards
153
-
154
- # Preprocessing function (adjust based on your model's input requirements)
155
- def preprocess_state(state):
156
- """
157
- Preprocess the game state for the neural network
158
- This should match the preprocessing used during training
159
- """
160
- if isinstance(state, np.ndarray) and len(state.shape) == 3:
161
- # If using image input
162
- state = np.transpose(state, (2, 0, 1)) # Channel first
163
- state = state / 255.0 # Normalize pixels
164
- return state.flatten() # or keep as image depending on model
165
- else:
166
- # If using game state features
167
- return np.array(list(state.values()))
168
-
169
- # Run the agent
170
- rewards = run_agent(model, env, episodes=10)
171
- ```
172
-
173
- ### Training Your Own Agent
174
-
175
- ```python
176
- import torch
177
- import torch.nn as nn
178
- import torch.optim as optim
179
- import numpy as np
180
- from collections import deque
181
-
182
- class PolicyNetwork(nn.Module):
183
- def __init__(self, state_size, action_size, hidden_size=64):
184
- super(PolicyNetwork, self).__init__()
185
- self.fc1 = nn.Linear(state_size, hidden_size)
186
- self.fc2 = nn.Linear(hidden_size, hidden_size)
187
- self.fc3 = nn.Linear(hidden_size, action_size)
188
- self.softmax = nn.Softmax(dim=1)
189
-
190
- def forward(self, x):
191
- x = torch.relu(self.fc1(x))
192
- x = torch.relu(self.fc2(x))
193
- x = self.fc3(x)
194
- return self.softmax(x)
195
-
196
- class REINFORCEAgent:
197
- def __init__(self, state_size, action_size, lr=0.001):
198
- self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
199
- self.policy_net = PolicyNetwork(state_size, action_size).to(self.device)
200
- self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
201
-
202
- self.saved_log_probs = []
203
- self.rewards = []
204
-
205
- def select_action(self, state):
206
- state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
207
- probs = self.policy_net(state)
208
- action = torch.multinomial(probs, 1)
209
-
210
- self.saved_log_probs.append(torch.log(probs.squeeze(0)[action]))
211
- return action.item()
212
-
213
- def update_policy(self, gamma=0.99):
214
- # Calculate discounted rewards
215
- discounted_rewards = []
216
- R = 0
217
-
218
- for r in reversed(self.rewards):
219
- R = r + gamma * R
220
- discounted_rewards.insert(0, R)
221
-
222
- # Normalize rewards
223
- discounted_rewards = torch.FloatTensor(discounted_rewards).to(self.device)
224
- discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8)
225
-
226
- # Calculate policy loss
227
- policy_loss = []
228
- for log_prob, reward in zip(self.saved_log_probs, discounted_rewards):
229
- policy_loss.append(-log_prob * reward)
230
-
231
- # Update policy
232
- self.optimizer.zero_grad()
233
- policy_loss = torch.cat(policy_loss).sum()
234
- policy_loss.backward()
235
- self.optimizer.step()
236
-
237
- # Clear episode data
238
- self.saved_log_probs.clear()
239
- self.rewards.clear()
240
-
241
- return policy_loss.item()
242
-
243
- def train_agent(episodes=2000):
244
- env = create_pixelcopter_env()
245
- env.init()
246
-
247
- # Determine state size based on your preprocessing
248
- state_size = len(preprocess_state(env.getScreenRGB())) # Adjust as needed
249
- action_size = 2 # do nothing, thrust
250
-
251
- agent = REINFORCEAgent(state_size, action_size)
252
-
253
- scores = deque(maxlen=100)
254
-
255
- for episode in range(episodes):
256
- env.reset_game()
257
- episode_reward = 0
258
-
259
- while not env.game_over():
260
- state = preprocess_state(env.getScreenRGB())
261
- action = agent.select_action(state)
262
-
263
- reward = env.act(action)
264
- agent.rewards.append(reward)
265
- episode_reward += reward
266
-
267
- # Update policy after each episode
268
- loss = agent.update_policy()
269
- scores.append(episode_reward)
270
-
271
- if episode % 100 == 0:
272
- avg_score = np.mean(scores)
273
- print(f"Episode {episode}, Average Score: {avg_score:.2f}, Loss: {loss:.4f}")
274
-
275
- # Save the trained model
276
- torch.save(agent.policy_net, "pixelcopter_reinforce_model.pth")
277
- return agent
278
-
279
- # Train a new agent
280
- # trained_agent = train_agent()
281
- ```
282
-
283
- ### Evaluation and Analysis
284
-
285
- ```python
286
- import matplotlib.pyplot as plt
287
-
288
- def evaluate_agent_detailed(model, env, episodes=50):
289
- """Detailed evaluation with statistics and visualization"""
290
- rewards = []
291
- episode_lengths = []
292
-
293
- for episode in range(episodes):
294
- env.reset_game()
295
- episode_reward = 0
296
- steps = 0
297
-
298
- while not env.game_over():
299
- state = preprocess_state(env.getScreenRGB())
300
- state_tensor = torch.FloatTensor(state).unsqueeze(0)
301
-
302
- with torch.no_grad():
303
- action_probs = model(state_tensor)
304
- action = torch.multinomial(action_probs, 1).item()
305
-
306
- reward = env.act(action)
307
- episode_reward += reward
308
- steps += 1
309
-
310
- rewards.append(episode_reward)
311
- episode_lengths.append(steps)
312
-
313
- if (episode + 1) % 10 == 0:
314
- print(f"Episodes {episode + 1}/{episodes} completed")
315
-
316
- # Statistical analysis
317
- mean_reward = np.mean(rewards)
318
- std_reward = np.std(rewards)
319
- median_reward = np.median(rewards)
320
- max_reward = np.max(rewards)
321
- min_reward = np.min(rewards)
322
-
323
- mean_length = np.mean(episode_lengths)
324
-
325
- print(f"\n--- Evaluation Results ---")
326
- print(f"Episodes: {episodes}")
327
- print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
328
- print(f"Median Reward: {median_reward:.2f}")
329
- print(f"Max Reward: {max_reward:.2f}")
330
- print(f"Min Reward: {min_reward:.2f}")
331
- print(f"Mean Episode Length: {mean_length:.1f} steps")
332
-
333
- # Visualization
334
- plt.figure(figsize=(12, 4))
335
-
336
- plt.subplot(1, 2, 1)
337
- plt.plot(rewards)
338
- plt.axhline(y=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
339
- plt.title('Episode Rewards')
340
- plt.xlabel('Episode')
341
- plt.ylabel('Reward')
342
- plt.legend()
343
-
344
- plt.subplot(1, 2, 2)
345
- plt.hist(rewards, bins=20, alpha=0.7)
346
- plt.axvline(x=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
347
- plt.title('Reward Distribution')
348
- plt.xlabel('Reward')
349
- plt.ylabel('Frequency')
350
- plt.legend()
351
-
352
- plt.tight_layout()
353
- plt.show()
354
-
355
- return {
356
- 'rewards': rewards,
357
- 'episode_lengths': episode_lengths,
358
- 'stats': {
359
- 'mean': mean_reward,
360
- 'std': std_reward,
361
- 'median': median_reward,
362
- 'max': max_reward,
363
- 'min': min_reward
364
- }
365
- }
366
-
367
- # Run detailed evaluation
368
- # results = evaluate_agent_detailed(model, env, episodes=100)
369
- ```
370
-
371
- ## Training Information
372
-
373
- ### Hyperparameters
374
-
375
- The REINFORCE agent was trained with carefully tuned hyperparameters:
376
- - **Learning Rate**: Optimized for stable policy gradient updates
377
- - **Discount Factor (γ)**: Balances immediate vs. future rewards
378
- - **Network Architecture**: Multi-layer perceptron with appropriate hidden dimensions
379
- - **Episode Length**: Sufficient episodes to learn temporal patterns
380
-
381
- ### Training Environment
382
-
383
- - **State Representation**: Processed game screen or extracted features
384
- - **Action Space**: Binary discrete actions (do nothing vs. thrust)
385
- - **Reward Signal**: Game score progression with survival bonus
386
- - **Training Episodes**: Extended training to achieve stable performance
387
-
388
- ### Algorithm Characteristics
389
-
390
- - **Sample Efficiency**: Moderate (typical for policy gradient methods)
391
- - **Stability**: Good convergence with proper hyperparameter tuning
392
- - **Exploration**: Built-in through stochastic policy
393
- - **Interpretability**: Clear policy learning through gradient ascent
394
-
395
- ## Limitations and Considerations
396
-
397
- - **Sample Efficiency**: REINFORCE requires many episodes to learn effectively
398
- - **Variance**: Policy gradient estimates can have high variance
399
- - **Environment Specific**: Trained specifically for Pixelcopter game mechanics
400
- - **Stochastic Performance**: Episode rewards vary due to policy stochasticity
401
- - **Real-time Performance**: Inference speed suitable for real-time game play
402
-
403
- ## Related Work and Extensions
404
-
405
- This model serves as an excellent educational example for:
406
- - **Policy Gradient Methods**: Understanding direct policy optimization
407
- - **Deep Reinforcement Learning**: Practical implementation of RL algorithms
408
- - **Game AI**: Learning complex temporal control tasks
409
- - **Baseline Comparisons**: Foundation for more advanced algorithms (A2C, PPO, etc.)
410
-
411
- ## Citation
412
-
413
- If you use this model in your research or educational projects, please cite:
414
-
415
- ```bibtex
416
- @misc{pixelcopter_reinforce_2024,
417
- title={REINFORCE Agent for Pixelcopter-PLE-v0},
418
- author={Adilbai},
419
- year={2024},
420
- publisher={Hugging Face},
421
- howpublished={\url{https://huggingface.co/Adilbai/Pixelcopter-RL}},
422
- note={Trained following Deep RL Course Unit 4}
423
- }
424
- ```
425
-
426
- ## Educational Resources
427
-
428
- This model was developed following the **Deep Reinforcement Learning Course Unit 4**:
429
- - **Course Link**: [https://huggingface.co/deep-rl-course/unit4/introduction](https://huggingface.co/deep-rl-course/unit4/introduction)
430
- - **Topic**: Policy Gradient Methods and REINFORCE
431
- - **Learning Objectives**: Understanding policy-based RL algorithms
432
-
433
- For comprehensive learning about REINFORCE and policy gradient methods, refer to the complete course materials.
434
-
435
- ## License
436
-
437
- This model is distributed under the MIT License. The model is intended for educational and research purposes.
 
6
  - custom-implementation
7
  - deep-rl-class
8
  model-index:
9
+ - name: DeepRL_pixelcopter_policy
10
  results:
11
  - task:
12
  type: reinforcement-learning
 
16
  type: Pixelcopter-PLE-v0
17
  metrics:
18
  - type: mean_reward
19
+ value: 31.50 +/- 29.61
20
  name: mean_reward
21
  verified: false
22
  ---
 
23
 
24
+ # **Reinforce** Agent playing **Pixelcopter-PLE-v0**
25
+ This is a trained model of a **Reinforce** agent playing **Pixelcopter-PLE-v0** .
26
+ To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction
27
+