Pixelcopter-RL / README.md
Adilbai's picture
Update README.md
f5e0a5b verified
---
tags:
- Pixelcopter-PLE-v0
- reinforce
- reinforcement-learning
- custom-implementation
- deep-rl-class
model-index:
- name: Pixelcopter-RL
results:
- task:
type: reinforcement-learning
name: reinforcement-learning
dataset:
name: Pixelcopter-PLE-v0
type: Pixelcopter-PLE-v0
metrics:
- type: mean_reward
value: 13.10 +/- 6.89
name: mean_reward
verified: false
---
# REINFORCE Agent for Pixelcopter-PLE-v0
## Model Description
This repository contains a trained REINFORCE (Policy Gradient) reinforcement learning agent that has learned to play Pixelcopter-PLE-v0, a challenging helicopter navigation game from the PyGame Learning Environment (PLE). The agent uses policy gradient methods to learn optimal flight control strategies through trial and error.
### Model Details
- **Algorithm**: REINFORCE (Monte Carlo Policy Gradient)
- **Environment**: Pixelcopter-PLE-v0 (PyGame Learning Environment)
- **Framework**: Custom implementation following Deep RL Course guidelines
- **Task Type**: Discrete Control (Binary Actions)
- **Action Space**: Discrete (2 actions: do nothing or thrust up)
- **Observation Space**: Visual/pixel-based or feature-based state representation
### Environment Overview
Pixelcopter-PLE-v0 is a classic helicopter control game where:
- **Objective**: Navigate a helicopter through obstacles without crashing
- **Challenge**: Requires precise timing and control to avoid ceiling, floor, and obstacles
- **Physics**: Gravity constantly pulls the helicopter down; player must apply thrust to maintain altitude
- **Scoring**: Points are awarded for surviving longer and successfully navigating through gaps
- **Difficulty**: Requires learning temporal dependencies and precise action timing
## Performance
The trained REINFORCE agent achieves the following performance metrics:
- **Mean Reward**: 13.10 ± 6.89
- **Performance Analysis**: This represents solid performance for this challenging environment
- **Consistency**: The standard deviation indicates moderate variability, which is expected for policy gradient methods
### Performance Context
The mean reward of 13.10 demonstrates that the agent has successfully learned to:
- Navigate through multiple obstacles before crashing
- Balance altitude control against obstacle avoidance
- Develop timing strategies for thrust application
- Achieve consistent survival beyond random baseline performance
The variability (±6.89) is characteristic of policy gradient methods and reflects the stochastic nature of the learned policy, which can lead to different episode outcomes based on exploration.
## Algorithm: REINFORCE
REINFORCE is a foundational policy gradient algorithm that:
- **Direct Policy Learning**: Learns a parameterized policy directly (no value function)
- **Monte Carlo Updates**: Uses complete episode returns for policy updates
- **Policy Gradient**: Updates policy parameters in direction of higher expected returns
- **Stochastic Policy**: Learns probabilistic action selection for exploration
### Key Advantages
- Simple and intuitive policy gradient approach
- Works well with discrete and continuous action spaces
- No need for value function approximation
- Good educational foundation for understanding policy gradients
## Usage
### Installation Requirements
```bash
# Core dependencies
pip install torch torchvision
pip install gymnasium
pip install pygame-learning-environment
pip install numpy matplotlib
# For visualization and analysis
pip install pillow
pip install imageio # for gif creation
```
### Loading and Using the Model
```python
import torch
import gymnasium as gym
from ple import PLE
from ple.games.pixelcopter import Pixelcopter
import numpy as np
# Load the trained model
# Note: Adjust path based on your model file structure
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load("pixelcopter_reinforce_model.pth", map_location=device)
model.eval()
# Create the environment
def create_pixelcopter_env():
game = Pixelcopter()
env = PLE(game, fps=30, display=True) # Set display=False for headless
return env
# Initialize environment
env = create_pixelcopter_env()
env.init()
# Run trained agent
def run_agent(model, env, episodes=5):
total_rewards = []
for episode in range(episodes):
env.reset_game()
episode_reward = 0
while not env.game_over():
# Get current state
state = env.getScreenRGB() # or env.getGameState() if using features
state = preprocess_state(state) # Apply your preprocessing
# Convert to tensor
state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
# Get action probabilities
with torch.no_grad():
action_probs = model(state_tensor)
action = torch.multinomial(action_probs, 1).item()
# Execute action (0: do nothing, 1: thrust)
reward = env.act(action)
episode_reward += reward
total_rewards.append(episode_reward)
print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")
mean_reward = np.mean(total_rewards)
std_reward = np.std(total_rewards)
print(f"\nAverage Performance: {mean_reward:.2f} ± {std_reward:.2f}")
return total_rewards
# Preprocessing function (adjust based on your model's input requirements)
def preprocess_state(state):
"""
Preprocess the game state for the neural network
This should match the preprocessing used during training
"""
if isinstance(state, np.ndarray) and len(state.shape) == 3:
# If using image input
state = np.transpose(state, (2, 0, 1)) # Channel first
state = state / 255.0 # Normalize pixels
return state.flatten() # or keep as image depending on model
else:
# If using game state features
return np.array(list(state.values()))
# Run the agent
rewards = run_agent(model, env, episodes=10)
```
### Training Your Own Agent
```python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
class PolicyNetwork(nn.Module):
def __init__(self, state_size, action_size, hidden_size=64):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(state_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.fc3 = nn.Linear(hidden_size, action_size)
self.softmax = nn.Softmax(dim=1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return self.softmax(x)
class REINFORCEAgent:
def __init__(self, state_size, action_size, lr=0.001):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.policy_net = PolicyNetwork(state_size, action_size).to(self.device)
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
self.saved_log_probs = []
self.rewards = []
def select_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
probs = self.policy_net(state)
action = torch.multinomial(probs, 1)
self.saved_log_probs.append(torch.log(probs.squeeze(0)[action]))
return action.item()
def update_policy(self, gamma=0.99):
# Calculate discounted rewards
discounted_rewards = []
R = 0
for r in reversed(self.rewards):
R = r + gamma * R
discounted_rewards.insert(0, R)
# Normalize rewards
discounted_rewards = torch.FloatTensor(discounted_rewards).to(self.device)
discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8)
# Calculate policy loss
policy_loss = []
for log_prob, reward in zip(self.saved_log_probs, discounted_rewards):
policy_loss.append(-log_prob * reward)
# Update policy
self.optimizer.zero_grad()
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
self.optimizer.step()
# Clear episode data
self.saved_log_probs.clear()
self.rewards.clear()
return policy_loss.item()
def train_agent(episodes=2000):
env = create_pixelcopter_env()
env.init()
# Determine state size based on your preprocessing
state_size = len(preprocess_state(env.getScreenRGB())) # Adjust as needed
action_size = 2 # do nothing, thrust
agent = REINFORCEAgent(state_size, action_size)
scores = deque(maxlen=100)
for episode in range(episodes):
env.reset_game()
episode_reward = 0
while not env.game_over():
state = preprocess_state(env.getScreenRGB())
action = agent.select_action(state)
reward = env.act(action)
agent.rewards.append(reward)
episode_reward += reward
# Update policy after each episode
loss = agent.update_policy()
scores.append(episode_reward)
if episode % 100 == 0:
avg_score = np.mean(scores)
print(f"Episode {episode}, Average Score: {avg_score:.2f}, Loss: {loss:.4f}")
# Save the trained model
torch.save(agent.policy_net, "pixelcopter_reinforce_model.pth")
return agent
# Train a new agent
# trained_agent = train_agent()
```
### Evaluation and Analysis
```python
import matplotlib.pyplot as plt
def evaluate_agent_detailed(model, env, episodes=50):
"""Detailed evaluation with statistics and visualization"""
rewards = []
episode_lengths = []
for episode in range(episodes):
env.reset_game()
episode_reward = 0
steps = 0
while not env.game_over():
state = preprocess_state(env.getScreenRGB())
state_tensor = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
action_probs = model(state_tensor)
action = torch.multinomial(action_probs, 1).item()
reward = env.act(action)
episode_reward += reward
steps += 1
rewards.append(episode_reward)
episode_lengths.append(steps)
if (episode + 1) % 10 == 0:
print(f"Episodes {episode + 1}/{episodes} completed")
# Statistical analysis
mean_reward = np.mean(rewards)
std_reward = np.std(rewards)
median_reward = np.median(rewards)
max_reward = np.max(rewards)
min_reward = np.min(rewards)
mean_length = np.mean(episode_lengths)
print(f"\n--- Evaluation Results ---")
print(f"Episodes: {episodes}")
print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
print(f"Median Reward: {median_reward:.2f}")
print(f"Max Reward: {max_reward:.2f}")
print(f"Min Reward: {min_reward:.2f}")
print(f"Mean Episode Length: {mean_length:.1f} steps")
# Visualization
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(rewards)
plt.axhline(y=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
plt.title('Episode Rewards')
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.legend()
plt.subplot(1, 2, 2)
plt.hist(rewards, bins=20, alpha=0.7)
plt.axvline(x=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
plt.title('Reward Distribution')
plt.xlabel('Reward')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
return {
'rewards': rewards,
'episode_lengths': episode_lengths,
'stats': {
'mean': mean_reward,
'std': std_reward,
'median': median_reward,
'max': max_reward,
'min': min_reward
}
}
# Run detailed evaluation
# results = evaluate_agent_detailed(model, env, episodes=100)
```
## Training Information
### Hyperparameters
The REINFORCE agent was trained with carefully tuned hyperparameters:
- **Learning Rate**: Optimized for stable policy gradient updates
- **Discount Factor (γ)**: Balances immediate vs. future rewards
- **Network Architecture**: Multi-layer perceptron with appropriate hidden dimensions
- **Episode Length**: Sufficient episodes to learn temporal patterns
### Training Environment
- **State Representation**: Processed game screen or extracted features
- **Action Space**: Binary discrete actions (do nothing vs. thrust)
- **Reward Signal**: Game score progression with survival bonus
- **Training Episodes**: Extended training to achieve stable performance
### Algorithm Characteristics
- **Sample Efficiency**: Moderate (typical for policy gradient methods)
- **Stability**: Good convergence with proper hyperparameter tuning
- **Exploration**: Built-in through stochastic policy
- **Interpretability**: Clear policy learning through gradient ascent
## Limitations and Considerations
- **Sample Efficiency**: REINFORCE requires many episodes to learn effectively
- **Variance**: Policy gradient estimates can have high variance
- **Environment Specific**: Trained specifically for Pixelcopter game mechanics
- **Stochastic Performance**: Episode rewards vary due to policy stochasticity
- **Real-time Performance**: Inference speed suitable for real-time game play
## Related Work and Extensions
This model serves as an excellent educational example for:
- **Policy Gradient Methods**: Understanding direct policy optimization
- **Deep Reinforcement Learning**: Practical implementation of RL algorithms
- **Game AI**: Learning complex temporal control tasks
- **Baseline Comparisons**: Foundation for more advanced algorithms (A2C, PPO, etc.)
## Citation
If you use this model in your research or educational projects, please cite:
```bibtex
@misc{pixelcopter_reinforce_2024,
title={REINFORCE Agent for Pixelcopter-PLE-v0},
author={Adilbai},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Adilbai/Pixelcopter-RL}},
note={Trained following Deep RL Course Unit 4}
}
```
## Educational Resources
This model was developed following the **Deep Reinforcement Learning Course Unit 4**:
- **Course Link**: [https://huggingface.co/deep-rl-course/unit4/introduction](https://huggingface.co/deep-rl-course/unit4/introduction)
- **Topic**: Policy Gradient Methods and REINFORCE
- **Learning Objectives**: Understanding policy-based RL algorithms
For comprehensive learning about REINFORCE and policy gradient methods, refer to the complete course materials.
## License
This model is distributed under the MIT License. The model is intended for educational and research purposes.