REINFORCE Agent playing Pixelcopter-PLE-v0

This is a trained model of a REINFORCE agent playing Pixelcopter-PLE-v0 using PyTorch and the Deep Reinforcement Learning Course.

Algorithm

REINFORCE is a policy gradient method that:

Directly optimizes the policy π(a|s)
Uses Monte Carlo sampling to estimate returns
Updates parameters in the direction of higher expected returns
Belongs to the family of Policy Gradient methods

Something To Say

😤Reach PLE (0.0.1) through trial and error with SSH Key on (https://github.com/ntasfi/PyGame-Learning-Environment)
😭Evaluate 100 turns to get a relatively low score
💡PixelCopter is wrapped with gymnasium.spaces in Unit 4_2.py
🙂Continue training 20k steps with Unit 4_2_continue.py after 40k steps in Unit 4_2.py
Running time Reference: 3h15min (40k steps)
☀️Wish you a good time~~~

Evaluation Results

Metric	Value
Mean Reward	31.85
Std Reward	26.03
Min Reward	2.00
Max Reward	118.00
Mean Episode Length	220.25
Score (mean - std)	5.82
Evaluation Episodes	100

Usage

import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
import numpy as np

class Policy(nn.Module):
    def __init__(self, s_size, a_size, h_size=128):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, h_size * 2)
        self.fc3 = nn.Linear(h_size * 2, a_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return F.softmax(self.fc3(x), dim=1)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("reinforce_pixelcopter.pth", map_location=device)

policy = Policy(checkpoint['s_size'], checkpoint['a_size'], checkpoint['hidden_size'])
policy.load_state_dict(checkpoint['policy_state_dict'])
policy.eval()

env = gym.make("Pixelcopter-PLE-v0")
state, _ = env.reset()

for step in range(1000):
    state_tensor = torch.from_numpy(state).float().unsqueeze(0)
    with torch.no_grad():
        probs = policy(state_tensor)
        action = torch.argmax(probs, dim=1).item()
    
    state, reward, terminated, truncated, _ = env.step(action)
    
    if terminated or truncated:
        state, _ = env.reset()


## Training Configuration

- **Algorithm**: REINFORCE (Policy Gradient)
- **Policy Network**: 3-layer MLP (256-512 hidden units)
- **Optimizer**: Adam
- **Learning Rate**: 0.00003+0.00002
- **Discount Factor**: 0.99+0.995
- **Training Episodes**: 40000+20000
- **Device**: cuda:0

## Training Hyperparameters
- Episodes: 40000+20000
- Max steps per episode: 1000
- Learning rate: 0.00003+0.00002
- Gamma (discount factor): 0.99+0.995
- Hidden layer size: 256-512
- Optimizer: Adam

Downloads last month: -

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on Pixelcopter-PLE-v0
self-reported

31.85 +/- 26.03