Pixelcopter-RL / README.md

Update README.md

f5e0a5b verified 8 months ago

15 kB

	---
	tags:
	- Pixelcopter-PLE-v0
	- reinforce
	- reinforcement-learning
	- custom-implementation
	- deep-rl-class
	model-index:
	- name: Pixelcopter-RL
	results:
	- task:
	type: reinforcement-learning
	name: reinforcement-learning
	dataset:
	name: Pixelcopter-PLE-v0
	type: Pixelcopter-PLE-v0
	metrics:
	- type: mean_reward
	value: 13.10 +/- 6.89
	name: mean_reward
	verified: false
	---
	# REINFORCE Agent for Pixelcopter-PLE-v0

	## Model Description

	This repository contains a trained REINFORCE (Policy Gradient) reinforcement learning agent that has learned to play Pixelcopter-PLE-v0, a challenging helicopter navigation game from the PyGame Learning Environment (PLE). The agent uses policy gradient methods to learn optimal flight control strategies through trial and error.

	### Model Details

	- Algorithm: REINFORCE (Monte Carlo Policy Gradient)
	- Environment: Pixelcopter-PLE-v0 (PyGame Learning Environment)
	- Framework: Custom implementation following Deep RL Course guidelines
	- Task Type: Discrete Control (Binary Actions)
	- Action Space: Discrete (2 actions: do nothing or thrust up)
	- Observation Space: Visual/pixel-based or feature-based state representation

	### Environment Overview

	Pixelcopter-PLE-v0 is a classic helicopter control game where:
	- Objective: Navigate a helicopter through obstacles without crashing
	- Challenge: Requires precise timing and control to avoid ceiling, floor, and obstacles
	- Physics: Gravity constantly pulls the helicopter down; player must apply thrust to maintain altitude
	- Scoring: Points are awarded for surviving longer and successfully navigating through gaps
	- Difficulty: Requires learning temporal dependencies and precise action timing

	## Performance

	The trained REINFORCE agent achieves the following performance metrics:

	- Mean Reward: 13.10 ± 6.89
	- Performance Analysis: This represents solid performance for this challenging environment
	- Consistency: The standard deviation indicates moderate variability, which is expected for policy gradient methods

	### Performance Context

	The mean reward of 13.10 demonstrates that the agent has successfully learned to:
	- Navigate through multiple obstacles before crashing
	- Balance altitude control against obstacle avoidance
	- Develop timing strategies for thrust application
	- Achieve consistent survival beyond random baseline performance

	The variability (±6.89) is characteristic of policy gradient methods and reflects the stochastic nature of the learned policy, which can lead to different episode outcomes based on exploration.

	## Algorithm: REINFORCE

	REINFORCE is a foundational policy gradient algorithm that:
	- Direct Policy Learning: Learns a parameterized policy directly (no value function)
	- Monte Carlo Updates: Uses complete episode returns for policy updates
	- Policy Gradient: Updates policy parameters in direction of higher expected returns
	- Stochastic Policy: Learns probabilistic action selection for exploration

	### Key Advantages
	- Simple and intuitive policy gradient approach
	- Works well with discrete and continuous action spaces
	- No need for value function approximation
	- Good educational foundation for understanding policy gradients

	## Usage

	### Installation Requirements

	```bash
	# Core dependencies
	pip install torch torchvision
	pip install gymnasium
	pip install pygame-learning-environment
	pip install numpy matplotlib

	# For visualization and analysis
	pip install pillow
	pip install imageio # for gif creation
	```

	### Loading and Using the Model

	```python
	import torch
	import gymnasium as gym
	from ple import PLE
	from ple.games.pixelcopter import Pixelcopter
	import numpy as np

	# Load the trained model
	# Note: Adjust path based on your model file structure
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = torch.load("pixelcopter_reinforce_model.pth", map_location=device)
	model.eval()

	# Create the environment
	def create_pixelcopter_env():
	game = Pixelcopter()
	env = PLE(game, fps=30, display=True) # Set display=False for headless
	return env

	# Initialize environment
	env = create_pixelcopter_env()
	env.init()

	# Run trained agent
	def run_agent(model, env, episodes=5):
	total_rewards = []

	for episode in range(episodes):
	env.reset_game()
	episode_reward = 0

	while not env.game_over():
	# Get current state
	state = env.getScreenRGB() # or env.getGameState() if using features
	state = preprocess_state(state) # Apply your preprocessing

	# Convert to tensor
	state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)

	# Get action probabilities
	with torch.no_grad():
	action_probs = model(state_tensor)
	action = torch.multinomial(action_probs, 1).item()

	# Execute action (0: do nothing, 1: thrust)
	reward = env.act(action)
	episode_reward += reward

	total_rewards.append(episode_reward)
	print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")

	mean_reward = np.mean(total_rewards)
	std_reward = np.std(total_rewards)
	print(f"\nAverage Performance: {mean_reward:.2f} ± {std_reward:.2f}")

	return total_rewards

	# Preprocessing function (adjust based on your model's input requirements)
	def preprocess_state(state):
	"""
	Preprocess the game state for the neural network
	This should match the preprocessing used during training
	"""
	if isinstance(state, np.ndarray) and len(state.shape) == 3:
	# If using image input
	state = np.transpose(state, (2, 0, 1)) # Channel first
	state = state / 255.0 # Normalize pixels
	return state.flatten() # or keep as image depending on model
	else:
	# If using game state features
	return np.array(list(state.values()))

	# Run the agent
	rewards = run_agent(model, env, episodes=10)
	```

	### Training Your Own Agent

	```python
	import torch
	import torch.nn as nn
	import torch.optim as optim
	import numpy as np
	from collections import deque

	class PolicyNetwork(nn.Module):
	def __init__(self, state_size, action_size, hidden_size=64):
	super(PolicyNetwork, self).__init__()
	self.fc1 = nn.Linear(state_size, hidden_size)
	self.fc2 = nn.Linear(hidden_size, hidden_size)
	self.fc3 = nn.Linear(hidden_size, action_size)
	self.softmax = nn.Softmax(dim=1)

	def forward(self, x):
	x = torch.relu(self.fc1(x))
	x = torch.relu(self.fc2(x))
	x = self.fc3(x)
	return self.softmax(x)

	class REINFORCEAgent:
	def __init__(self, state_size, action_size, lr=0.001):
	self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	self.policy_net = PolicyNetwork(state_size, action_size).to(self.device)
	self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)

	self.saved_log_probs = []
	self.rewards = []

	def select_action(self, state):
	state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
	probs = self.policy_net(state)
	action = torch.multinomial(probs, 1)

	self.saved_log_probs.append(torch.log(probs.squeeze(0)[action]))
	return action.item()

	def update_policy(self, gamma=0.99):
	# Calculate discounted rewards
	discounted_rewards = []
	R = 0

	for r in reversed(self.rewards):
	R = r + gamma * R
	discounted_rewards.insert(0, R)

	# Normalize rewards
	discounted_rewards = torch.FloatTensor(discounted_rewards).to(self.device)
	discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8)

	# Calculate policy loss
	policy_loss = []
	for log_prob, reward in zip(self.saved_log_probs, discounted_rewards):
	policy_loss.append(-log_prob * reward)

	# Update policy
	self.optimizer.zero_grad()
	policy_loss = torch.cat(policy_loss).sum()
	policy_loss.backward()
	self.optimizer.step()

	# Clear episode data
	self.saved_log_probs.clear()
	self.rewards.clear()

	return policy_loss.item()

	def train_agent(episodes=2000):
	env = create_pixelcopter_env()
	env.init()

	# Determine state size based on your preprocessing
	state_size = len(preprocess_state(env.getScreenRGB())) # Adjust as needed
	action_size = 2 # do nothing, thrust

	agent = REINFORCEAgent(state_size, action_size)

	scores = deque(maxlen=100)

	for episode in range(episodes):
	env.reset_game()
	episode_reward = 0

	while not env.game_over():
	state = preprocess_state(env.getScreenRGB())
	action = agent.select_action(state)

	reward = env.act(action)
	agent.rewards.append(reward)
	episode_reward += reward

	# Update policy after each episode
	loss = agent.update_policy()
	scores.append(episode_reward)

	if episode % 100 == 0:
	avg_score = np.mean(scores)
	print(f"Episode {episode}, Average Score: {avg_score:.2f}, Loss: {loss:.4f}")

	# Save the trained model
	torch.save(agent.policy_net, "pixelcopter_reinforce_model.pth")
	return agent

	# Train a new agent
	# trained_agent = train_agent()
	```

	### Evaluation and Analysis

	```python
	import matplotlib.pyplot as plt

	def evaluate_agent_detailed(model, env, episodes=50):
	"""Detailed evaluation with statistics and visualization"""
	rewards = []
	episode_lengths = []

	for episode in range(episodes):
	env.reset_game()
	episode_reward = 0
	steps = 0

	while not env.game_over():
	state = preprocess_state(env.getScreenRGB())
	state_tensor = torch.FloatTensor(state).unsqueeze(0)

	with torch.no_grad():
	action_probs = model(state_tensor)
	action = torch.multinomial(action_probs, 1).item()

	reward = env.act(action)
	episode_reward += reward
	steps += 1

	rewards.append(episode_reward)
	episode_lengths.append(steps)

	if (episode + 1) % 10 == 0:
	print(f"Episodes {episode + 1}/{episodes} completed")

	# Statistical analysis
	mean_reward = np.mean(rewards)
	std_reward = np.std(rewards)
	median_reward = np.median(rewards)
	max_reward = np.max(rewards)
	min_reward = np.min(rewards)

	mean_length = np.mean(episode_lengths)

	print(f"\n--- Evaluation Results ---")
	print(f"Episodes: {episodes}")
	print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
	print(f"Median Reward: {median_reward:.2f}")
	print(f"Max Reward: {max_reward:.2f}")
	print(f"Min Reward: {min_reward:.2f}")
	print(f"Mean Episode Length: {mean_length:.1f} steps")

	# Visualization
	plt.figure(figsize=(12, 4))

	plt.subplot(1, 2, 1)
	plt.plot(rewards)
	plt.axhline(y=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
	plt.title('Episode Rewards')
	plt.xlabel('Episode')
	plt.ylabel('Reward')
	plt.legend()

	plt.subplot(1, 2, 2)
	plt.hist(rewards, bins=20, alpha=0.7)
	plt.axvline(x=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
	plt.title('Reward Distribution')
	plt.xlabel('Reward')
	plt.ylabel('Frequency')
	plt.legend()

	plt.tight_layout()
	plt.show()

	return {
	'rewards': rewards,
	'episode_lengths': episode_lengths,
	'stats': {
	'mean': mean_reward,
	'std': std_reward,
	'median': median_reward,
	'max': max_reward,
	'min': min_reward
	}
	}

	# Run detailed evaluation
	# results = evaluate_agent_detailed(model, env, episodes=100)
	```

	## Training Information

	### Hyperparameters

	The REINFORCE agent was trained with carefully tuned hyperparameters:
	- Learning Rate: Optimized for stable policy gradient updates
	- Discount Factor (γ): Balances immediate vs. future rewards
	- Network Architecture: Multi-layer perceptron with appropriate hidden dimensions
	- Episode Length: Sufficient episodes to learn temporal patterns

	### Training Environment

	- State Representation: Processed game screen or extracted features
	- Action Space: Binary discrete actions (do nothing vs. thrust)
	- Reward Signal: Game score progression with survival bonus
	- Training Episodes: Extended training to achieve stable performance

	### Algorithm Characteristics

	- Sample Efficiency: Moderate (typical for policy gradient methods)
	- Stability: Good convergence with proper hyperparameter tuning
	- Exploration: Built-in through stochastic policy
	- Interpretability: Clear policy learning through gradient ascent

	## Limitations and Considerations

	- Sample Efficiency: REINFORCE requires many episodes to learn effectively
	- Variance: Policy gradient estimates can have high variance
	- Environment Specific: Trained specifically for Pixelcopter game mechanics
	- Stochastic Performance: Episode rewards vary due to policy stochasticity
	- Real-time Performance: Inference speed suitable for real-time game play

	## Related Work and Extensions

	This model serves as an excellent educational example for:
	- Policy Gradient Methods: Understanding direct policy optimization
	- Deep Reinforcement Learning: Practical implementation of RL algorithms
	- Game AI: Learning complex temporal control tasks
	- Baseline Comparisons: Foundation for more advanced algorithms (A2C, PPO, etc.)

	## Citation

	If you use this model in your research or educational projects, please cite:

	```bibtex
	@misc{pixelcopter_reinforce_2024,
	title={REINFORCE Agent for Pixelcopter-PLE-v0},
	author={Adilbai},
	year={2024},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/Adilbai/Pixelcopter-RL}},
	note={Trained following Deep RL Course Unit 4}
	}
	```

	## Educational Resources

	This model was developed following the Deep Reinforcement Learning Course Unit 4:
	- Course Link: [https://huggingface.co/deep-rl-course/unit4/introduction](https://huggingface.co/deep-rl-course/unit4/introduction)
	- Topic: Policy Gradient Methods and REINFORCE
	- Learning Objectives: Understanding policy-based RL algorithms

	For comprehensive learning about REINFORCE and policy gradient methods, refer to the complete course materials.

	## License

	This model is distributed under the MIT License. The model is intended for educational and research purposes.