๐ Reinforce Agent โ Pixelcopter-PLE-v0
A policy gradient agent trained from scratch using the REINFORCE algorithm to play Pixelcopter, a challenging continuous control game built on the PyGame Learning Environment (PLE).
๐ Performance
| Metric | Value |
|---|---|
| Mean Reward | 58.13 |
| Std of Reward | ยฑ55.17 |
| Best Average Score | 80.65 (Episode 46000) |
| Evaluation Episodes | 10 |
| Training Episodes | 50,000 |
๐ง Algorithm โ REINFORCE (Monte Carlo Policy Gradient)
REINFORCE is a classic policy gradient method that directly optimizes the policy by:
- Rolling out full episodes using the current policy
- Computing discounted returns Gโ = rโโโ + ฮณrโโโ + ฮณยฒrโโโ + ... for each timestep
- Updating the policy by maximizing E[ log ฯ_ฮธ(a|s) ยท Gโ ]
The policy network is a simple feedforward neural network:
- Input: State observation vector
- Hidden layer: Fully connected + ReLU activation
- Output: Action probabilities via Softmax
โ๏ธ Hyperparameters
| Parameter | Value |
|---|---|
| Hidden layer size | 64 |
| Training episodes | 50,000 |
| Max steps per episode | 10,000 |
| Discount factor (ฮณ) | 0.99 |
| Learning rate | 1e-4 |
| Optimizer | Adam |
๐ฎ About the Environment
Pixelcopter-PLE-v0 is a side-scrolling game where the agent controls a helicopter and must navigate through gaps in walls without crashing.
- Observation space: 7 continuous values (player velocity, player y-position, wall positions, etc.)
- Action space: 2 discrete actions โ throttle up or do nothing
- Reward: +1 for each timestep survived
- Episode ends: On collision with a wall or the ground/ceiling
๐ How to Use
from ple.games.pixelcopter import Pixelcopter
from ple import PLE
import torch
# Load the model
model = torch.load("model.pt", map_location=torch.device("cpu"))
model.eval()
# Run inference
state, _ = env.reset()
action, _ = model.act(state)
๐ Training Details
- Framework: PyTorch
- Returns: Standardized per episode for training stability
- Environment API: PyGame Learning Environment (PLE) via custom Gymnasium wrapper
๐ค Author
Trained by nirmanpatel as part of the Hugging Face Deep Reinforcement Learning Course.
Evaluation results
- mean_reward on Pixelcopter-PLE-v0self-reported58.13 +/- 55.17