|
|
--- |
|
|
tags: |
|
|
- Pixelcopter-PLE-v0 |
|
|
- reinforce |
|
|
- reinforcement-learning |
|
|
- custom-implementation |
|
|
- deep-rl-class |
|
|
model-index: |
|
|
- name: Pixelcopter-RL |
|
|
results: |
|
|
- task: |
|
|
type: reinforcement-learning |
|
|
name: reinforcement-learning |
|
|
dataset: |
|
|
name: Pixelcopter-PLE-v0 |
|
|
type: Pixelcopter-PLE-v0 |
|
|
metrics: |
|
|
- type: mean_reward |
|
|
value: 13.10 +/- 6.89 |
|
|
name: mean_reward |
|
|
verified: false |
|
|
--- |
|
|
# REINFORCE Agent for Pixelcopter-PLE-v0 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This repository contains a trained REINFORCE (Policy Gradient) reinforcement learning agent that has learned to play Pixelcopter-PLE-v0, a challenging helicopter navigation game from the PyGame Learning Environment (PLE). The agent uses policy gradient methods to learn optimal flight control strategies through trial and error. |
|
|
|
|
|
### Model Details |
|
|
|
|
|
- **Algorithm**: REINFORCE (Monte Carlo Policy Gradient) |
|
|
- **Environment**: Pixelcopter-PLE-v0 (PyGame Learning Environment) |
|
|
- **Framework**: Custom implementation following Deep RL Course guidelines |
|
|
- **Task Type**: Discrete Control (Binary Actions) |
|
|
- **Action Space**: Discrete (2 actions: do nothing or thrust up) |
|
|
- **Observation Space**: Visual/pixel-based or feature-based state representation |
|
|
|
|
|
### Environment Overview |
|
|
|
|
|
Pixelcopter-PLE-v0 is a classic helicopter control game where: |
|
|
- **Objective**: Navigate a helicopter through obstacles without crashing |
|
|
- **Challenge**: Requires precise timing and control to avoid ceiling, floor, and obstacles |
|
|
- **Physics**: Gravity constantly pulls the helicopter down; player must apply thrust to maintain altitude |
|
|
- **Scoring**: Points are awarded for surviving longer and successfully navigating through gaps |
|
|
- **Difficulty**: Requires learning temporal dependencies and precise action timing |
|
|
|
|
|
## Performance |
|
|
|
|
|
The trained REINFORCE agent achieves the following performance metrics: |
|
|
|
|
|
- **Mean Reward**: 13.10 ± 6.89 |
|
|
- **Performance Analysis**: This represents solid performance for this challenging environment |
|
|
- **Consistency**: The standard deviation indicates moderate variability, which is expected for policy gradient methods |
|
|
|
|
|
### Performance Context |
|
|
|
|
|
The mean reward of 13.10 demonstrates that the agent has successfully learned to: |
|
|
- Navigate through multiple obstacles before crashing |
|
|
- Balance altitude control against obstacle avoidance |
|
|
- Develop timing strategies for thrust application |
|
|
- Achieve consistent survival beyond random baseline performance |
|
|
|
|
|
The variability (±6.89) is characteristic of policy gradient methods and reflects the stochastic nature of the learned policy, which can lead to different episode outcomes based on exploration. |
|
|
|
|
|
## Algorithm: REINFORCE |
|
|
|
|
|
REINFORCE is a foundational policy gradient algorithm that: |
|
|
- **Direct Policy Learning**: Learns a parameterized policy directly (no value function) |
|
|
- **Monte Carlo Updates**: Uses complete episode returns for policy updates |
|
|
- **Policy Gradient**: Updates policy parameters in direction of higher expected returns |
|
|
- **Stochastic Policy**: Learns probabilistic action selection for exploration |
|
|
|
|
|
### Key Advantages |
|
|
- Simple and intuitive policy gradient approach |
|
|
- Works well with discrete and continuous action spaces |
|
|
- No need for value function approximation |
|
|
- Good educational foundation for understanding policy gradients |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation Requirements |
|
|
|
|
|
```bash |
|
|
# Core dependencies |
|
|
pip install torch torchvision |
|
|
pip install gymnasium |
|
|
pip install pygame-learning-environment |
|
|
pip install numpy matplotlib |
|
|
|
|
|
# For visualization and analysis |
|
|
pip install pillow |
|
|
pip install imageio # for gif creation |
|
|
``` |
|
|
|
|
|
### Loading and Using the Model |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import gymnasium as gym |
|
|
from ple import PLE |
|
|
from ple.games.pixelcopter import Pixelcopter |
|
|
import numpy as np |
|
|
|
|
|
# Load the trained model |
|
|
# Note: Adjust path based on your model file structure |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model = torch.load("pixelcopter_reinforce_model.pth", map_location=device) |
|
|
model.eval() |
|
|
|
|
|
# Create the environment |
|
|
def create_pixelcopter_env(): |
|
|
game = Pixelcopter() |
|
|
env = PLE(game, fps=30, display=True) # Set display=False for headless |
|
|
return env |
|
|
|
|
|
# Initialize environment |
|
|
env = create_pixelcopter_env() |
|
|
env.init() |
|
|
|
|
|
# Run trained agent |
|
|
def run_agent(model, env, episodes=5): |
|
|
total_rewards = [] |
|
|
|
|
|
for episode in range(episodes): |
|
|
env.reset_game() |
|
|
episode_reward = 0 |
|
|
|
|
|
while not env.game_over(): |
|
|
# Get current state |
|
|
state = env.getScreenRGB() # or env.getGameState() if using features |
|
|
state = preprocess_state(state) # Apply your preprocessing |
|
|
|
|
|
# Convert to tensor |
|
|
state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device) |
|
|
|
|
|
# Get action probabilities |
|
|
with torch.no_grad(): |
|
|
action_probs = model(state_tensor) |
|
|
action = torch.multinomial(action_probs, 1).item() |
|
|
|
|
|
# Execute action (0: do nothing, 1: thrust) |
|
|
reward = env.act(action) |
|
|
episode_reward += reward |
|
|
|
|
|
total_rewards.append(episode_reward) |
|
|
print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}") |
|
|
|
|
|
mean_reward = np.mean(total_rewards) |
|
|
std_reward = np.std(total_rewards) |
|
|
print(f"\nAverage Performance: {mean_reward:.2f} ± {std_reward:.2f}") |
|
|
|
|
|
return total_rewards |
|
|
|
|
|
# Preprocessing function (adjust based on your model's input requirements) |
|
|
def preprocess_state(state): |
|
|
""" |
|
|
Preprocess the game state for the neural network |
|
|
This should match the preprocessing used during training |
|
|
""" |
|
|
if isinstance(state, np.ndarray) and len(state.shape) == 3: |
|
|
# If using image input |
|
|
state = np.transpose(state, (2, 0, 1)) # Channel first |
|
|
state = state / 255.0 # Normalize pixels |
|
|
return state.flatten() # or keep as image depending on model |
|
|
else: |
|
|
# If using game state features |
|
|
return np.array(list(state.values())) |
|
|
|
|
|
# Run the agent |
|
|
rewards = run_agent(model, env, episodes=10) |
|
|
``` |
|
|
|
|
|
### Training Your Own Agent |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
import torch.optim as optim |
|
|
import numpy as np |
|
|
from collections import deque |
|
|
|
|
|
class PolicyNetwork(nn.Module): |
|
|
def __init__(self, state_size, action_size, hidden_size=64): |
|
|
super(PolicyNetwork, self).__init__() |
|
|
self.fc1 = nn.Linear(state_size, hidden_size) |
|
|
self.fc2 = nn.Linear(hidden_size, hidden_size) |
|
|
self.fc3 = nn.Linear(hidden_size, action_size) |
|
|
self.softmax = nn.Softmax(dim=1) |
|
|
|
|
|
def forward(self, x): |
|
|
x = torch.relu(self.fc1(x)) |
|
|
x = torch.relu(self.fc2(x)) |
|
|
x = self.fc3(x) |
|
|
return self.softmax(x) |
|
|
|
|
|
class REINFORCEAgent: |
|
|
def __init__(self, state_size, action_size, lr=0.001): |
|
|
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
self.policy_net = PolicyNetwork(state_size, action_size).to(self.device) |
|
|
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr) |
|
|
|
|
|
self.saved_log_probs = [] |
|
|
self.rewards = [] |
|
|
|
|
|
def select_action(self, state): |
|
|
state = torch.FloatTensor(state).unsqueeze(0).to(self.device) |
|
|
probs = self.policy_net(state) |
|
|
action = torch.multinomial(probs, 1) |
|
|
|
|
|
self.saved_log_probs.append(torch.log(probs.squeeze(0)[action])) |
|
|
return action.item() |
|
|
|
|
|
def update_policy(self, gamma=0.99): |
|
|
# Calculate discounted rewards |
|
|
discounted_rewards = [] |
|
|
R = 0 |
|
|
|
|
|
for r in reversed(self.rewards): |
|
|
R = r + gamma * R |
|
|
discounted_rewards.insert(0, R) |
|
|
|
|
|
# Normalize rewards |
|
|
discounted_rewards = torch.FloatTensor(discounted_rewards).to(self.device) |
|
|
discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8) |
|
|
|
|
|
# Calculate policy loss |
|
|
policy_loss = [] |
|
|
for log_prob, reward in zip(self.saved_log_probs, discounted_rewards): |
|
|
policy_loss.append(-log_prob * reward) |
|
|
|
|
|
# Update policy |
|
|
self.optimizer.zero_grad() |
|
|
policy_loss = torch.cat(policy_loss).sum() |
|
|
policy_loss.backward() |
|
|
self.optimizer.step() |
|
|
|
|
|
# Clear episode data |
|
|
self.saved_log_probs.clear() |
|
|
self.rewards.clear() |
|
|
|
|
|
return policy_loss.item() |
|
|
|
|
|
def train_agent(episodes=2000): |
|
|
env = create_pixelcopter_env() |
|
|
env.init() |
|
|
|
|
|
# Determine state size based on your preprocessing |
|
|
state_size = len(preprocess_state(env.getScreenRGB())) # Adjust as needed |
|
|
action_size = 2 # do nothing, thrust |
|
|
|
|
|
agent = REINFORCEAgent(state_size, action_size) |
|
|
|
|
|
scores = deque(maxlen=100) |
|
|
|
|
|
for episode in range(episodes): |
|
|
env.reset_game() |
|
|
episode_reward = 0 |
|
|
|
|
|
while not env.game_over(): |
|
|
state = preprocess_state(env.getScreenRGB()) |
|
|
action = agent.select_action(state) |
|
|
|
|
|
reward = env.act(action) |
|
|
agent.rewards.append(reward) |
|
|
episode_reward += reward |
|
|
|
|
|
# Update policy after each episode |
|
|
loss = agent.update_policy() |
|
|
scores.append(episode_reward) |
|
|
|
|
|
if episode % 100 == 0: |
|
|
avg_score = np.mean(scores) |
|
|
print(f"Episode {episode}, Average Score: {avg_score:.2f}, Loss: {loss:.4f}") |
|
|
|
|
|
# Save the trained model |
|
|
torch.save(agent.policy_net, "pixelcopter_reinforce_model.pth") |
|
|
return agent |
|
|
|
|
|
# Train a new agent |
|
|
# trained_agent = train_agent() |
|
|
``` |
|
|
|
|
|
### Evaluation and Analysis |
|
|
|
|
|
```python |
|
|
import matplotlib.pyplot as plt |
|
|
|
|
|
def evaluate_agent_detailed(model, env, episodes=50): |
|
|
"""Detailed evaluation with statistics and visualization""" |
|
|
rewards = [] |
|
|
episode_lengths = [] |
|
|
|
|
|
for episode in range(episodes): |
|
|
env.reset_game() |
|
|
episode_reward = 0 |
|
|
steps = 0 |
|
|
|
|
|
while not env.game_over(): |
|
|
state = preprocess_state(env.getScreenRGB()) |
|
|
state_tensor = torch.FloatTensor(state).unsqueeze(0) |
|
|
|
|
|
with torch.no_grad(): |
|
|
action_probs = model(state_tensor) |
|
|
action = torch.multinomial(action_probs, 1).item() |
|
|
|
|
|
reward = env.act(action) |
|
|
episode_reward += reward |
|
|
steps += 1 |
|
|
|
|
|
rewards.append(episode_reward) |
|
|
episode_lengths.append(steps) |
|
|
|
|
|
if (episode + 1) % 10 == 0: |
|
|
print(f"Episodes {episode + 1}/{episodes} completed") |
|
|
|
|
|
# Statistical analysis |
|
|
mean_reward = np.mean(rewards) |
|
|
std_reward = np.std(rewards) |
|
|
median_reward = np.median(rewards) |
|
|
max_reward = np.max(rewards) |
|
|
min_reward = np.min(rewards) |
|
|
|
|
|
mean_length = np.mean(episode_lengths) |
|
|
|
|
|
print(f"\n--- Evaluation Results ---") |
|
|
print(f"Episodes: {episodes}") |
|
|
print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}") |
|
|
print(f"Median Reward: {median_reward:.2f}") |
|
|
print(f"Max Reward: {max_reward:.2f}") |
|
|
print(f"Min Reward: {min_reward:.2f}") |
|
|
print(f"Mean Episode Length: {mean_length:.1f} steps") |
|
|
|
|
|
# Visualization |
|
|
plt.figure(figsize=(12, 4)) |
|
|
|
|
|
plt.subplot(1, 2, 1) |
|
|
plt.plot(rewards) |
|
|
plt.axhline(y=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}') |
|
|
plt.title('Episode Rewards') |
|
|
plt.xlabel('Episode') |
|
|
plt.ylabel('Reward') |
|
|
plt.legend() |
|
|
|
|
|
plt.subplot(1, 2, 2) |
|
|
plt.hist(rewards, bins=20, alpha=0.7) |
|
|
plt.axvline(x=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}') |
|
|
plt.title('Reward Distribution') |
|
|
plt.xlabel('Reward') |
|
|
plt.ylabel('Frequency') |
|
|
plt.legend() |
|
|
|
|
|
plt.tight_layout() |
|
|
plt.show() |
|
|
|
|
|
return { |
|
|
'rewards': rewards, |
|
|
'episode_lengths': episode_lengths, |
|
|
'stats': { |
|
|
'mean': mean_reward, |
|
|
'std': std_reward, |
|
|
'median': median_reward, |
|
|
'max': max_reward, |
|
|
'min': min_reward |
|
|
} |
|
|
} |
|
|
|
|
|
# Run detailed evaluation |
|
|
# results = evaluate_agent_detailed(model, env, episodes=100) |
|
|
``` |
|
|
|
|
|
## Training Information |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
The REINFORCE agent was trained with carefully tuned hyperparameters: |
|
|
- **Learning Rate**: Optimized for stable policy gradient updates |
|
|
- **Discount Factor (γ)**: Balances immediate vs. future rewards |
|
|
- **Network Architecture**: Multi-layer perceptron with appropriate hidden dimensions |
|
|
- **Episode Length**: Sufficient episodes to learn temporal patterns |
|
|
|
|
|
### Training Environment |
|
|
|
|
|
- **State Representation**: Processed game screen or extracted features |
|
|
- **Action Space**: Binary discrete actions (do nothing vs. thrust) |
|
|
- **Reward Signal**: Game score progression with survival bonus |
|
|
- **Training Episodes**: Extended training to achieve stable performance |
|
|
|
|
|
### Algorithm Characteristics |
|
|
|
|
|
- **Sample Efficiency**: Moderate (typical for policy gradient methods) |
|
|
- **Stability**: Good convergence with proper hyperparameter tuning |
|
|
- **Exploration**: Built-in through stochastic policy |
|
|
- **Interpretability**: Clear policy learning through gradient ascent |
|
|
|
|
|
## Limitations and Considerations |
|
|
|
|
|
- **Sample Efficiency**: REINFORCE requires many episodes to learn effectively |
|
|
- **Variance**: Policy gradient estimates can have high variance |
|
|
- **Environment Specific**: Trained specifically for Pixelcopter game mechanics |
|
|
- **Stochastic Performance**: Episode rewards vary due to policy stochasticity |
|
|
- **Real-time Performance**: Inference speed suitable for real-time game play |
|
|
|
|
|
## Related Work and Extensions |
|
|
|
|
|
This model serves as an excellent educational example for: |
|
|
- **Policy Gradient Methods**: Understanding direct policy optimization |
|
|
- **Deep Reinforcement Learning**: Practical implementation of RL algorithms |
|
|
- **Game AI**: Learning complex temporal control tasks |
|
|
- **Baseline Comparisons**: Foundation for more advanced algorithms (A2C, PPO, etc.) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or educational projects, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{pixelcopter_reinforce_2024, |
|
|
title={REINFORCE Agent for Pixelcopter-PLE-v0}, |
|
|
author={Adilbai}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/Adilbai/Pixelcopter-RL}}, |
|
|
note={Trained following Deep RL Course Unit 4} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Educational Resources |
|
|
|
|
|
This model was developed following the **Deep Reinforcement Learning Course Unit 4**: |
|
|
- **Course Link**: [https://huggingface.co/deep-rl-course/unit4/introduction](https://huggingface.co/deep-rl-course/unit4/introduction) |
|
|
- **Topic**: Policy Gradient Methods and REINFORCE |
|
|
- **Learning Objectives**: Understanding policy-based RL algorithms |
|
|
|
|
|
For comprehensive learning about REINFORCE and policy gradient methods, refer to the complete course materials. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is distributed under the MIT License. The model is intended for educational and research purposes. |