REINFORCE Agent playing Pixelcopter-PLE-v0
This is a trained model of a REINFORCE agent playing Pixelcopter-PLE-v0 using PyTorch and the Deep Reinforcement Learning Course.
Algorithm
REINFORCE is a policy gradient method that:
- Directly optimizes the policy π(a|s)
- Uses Monte Carlo sampling to estimate returns
- Updates parameters in the direction of higher expected returns
- Belongs to the family of Policy Gradient methods
Something To Say
😤Reach PLE (0.0.1) through trial and error with SSH Key on (https://github.com/ntasfi/PyGame-Learning-Environment)
😭Evaluate 100 turns to get a relatively low score
💡PixelCopter is wrapped with
gymnasium.spacesinUnit 4_2.py🙂Continue training 20k steps with
Unit 4_2_continue.pyafter 40k steps inUnit 4_2.pyRunning time Reference: 3h15min (40k steps)
☀️Wish you a good time~~~
Evaluation Results
| Metric | Value |
|---|---|
| Mean Reward | 31.85 |
| Std Reward | 26.03 |
| Min Reward | 2.00 |
| Max Reward | 118.00 |
| Mean Episode Length | 220.25 |
| Score (mean - std) | 5.82 |
| Evaluation Episodes | 100 |
Usage
import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
import numpy as np
class Policy(nn.Module):
def __init__(self, s_size, a_size, h_size=128):
super(Policy, self).__init__()
self.fc1 = nn.Linear(s_size, h_size)
self.fc2 = nn.Linear(h_size, h_size * 2)
self.fc3 = nn.Linear(h_size * 2, a_size)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return F.softmax(self.fc3(x), dim=1)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("reinforce_pixelcopter.pth", map_location=device)
policy = Policy(checkpoint['s_size'], checkpoint['a_size'], checkpoint['hidden_size'])
policy.load_state_dict(checkpoint['policy_state_dict'])
policy.eval()
env = gym.make("Pixelcopter-PLE-v0")
state, _ = env.reset()
for step in range(1000):
state_tensor = torch.from_numpy(state).float().unsqueeze(0)
with torch.no_grad():
probs = policy(state_tensor)
action = torch.argmax(probs, dim=1).item()
state, reward, terminated, truncated, _ = env.step(action)
if terminated or truncated:
state, _ = env.reset()
## Training Configuration
- **Algorithm**: REINFORCE (Policy Gradient)
- **Policy Network**: 3-layer MLP (256-512 hidden units)
- **Optimizer**: Adam
- **Learning Rate**: 0.00003+0.00002
- **Discount Factor**: 0.99+0.995
- **Training Episodes**: 40000+20000
- **Device**: cuda:0
## Training Hyperparameters
- Episodes: 40000+20000
- Max steps per episode: 1000
- Learning rate: 0.00003+0.00002
- Gamma (discount factor): 0.99+0.995
- Hidden layer size: 256-512
- Optimizer: Adam
- Downloads last month
- 46
Evaluation results
- mean_reward on Pixelcopter-PLE-v0self-reported31.85 +/- 26.03