File size: 2,962 Bytes
e655e86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e46b456
 
 
 
e655e86
 
e46b456
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
tags:
- Pixelcopter-PLE-v0
- reinforce
- reinforcement-learning
- custom-implementation
- deep-rl-class
model-index:
- name: Reinforce-PixelCopter
  results:
  - task:
      type: reinforcement-learning
      name: reinforcement-learning
    dataset:
      name: Pixelcopter-PLE-v0
      type: Pixelcopter-PLE-v0
    metrics:
        - type: mean_reward
          value: 58.13 +/- 55.17
          name: mean_reward
          verified: false
---

# ๐Ÿš Reinforce Agent โ€” Pixelcopter-PLE-v0

A policy gradient agent trained from scratch using the **REINFORCE** algorithm to play [Pixelcopter](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html), a challenging continuous control game built on the PyGame Learning Environment (PLE).

---

## ๐Ÿ“Š Performance

| Metric | Value |
|--------|-------|
| Mean Reward | 58.13 |
| Std of Reward | ยฑ55.17 |
| Best Average Score | 80.65 (Episode 46000) |
| Evaluation Episodes | 10 |
| Training Episodes | 50,000 |

---

## ๐Ÿง  Algorithm โ€” REINFORCE (Monte Carlo Policy Gradient)

REINFORCE is a classic **policy gradient** method that directly optimizes the policy by:
1. Rolling out full episodes using the current policy
2. Computing discounted returns **Gโ‚œ = rโ‚œโ‚Šโ‚ + ฮณrโ‚œโ‚Šโ‚‚ + ฮณยฒrโ‚œโ‚Šโ‚ƒ + ...** for each timestep  
3. Updating the policy by maximizing **E[ log ฯ€_ฮธ(a|s) ยท Gโ‚œ ]**

The policy network is a simple feedforward neural network:
- **Input:** State observation vector
- **Hidden layer:** Fully connected + ReLU activation
- **Output:** Action probabilities via Softmax

---

## โš™๏ธ Hyperparameters

| Parameter | Value |
|-----------|-------|
| Hidden layer size | 64 |
| Training episodes | 50,000 |
| Max steps per episode | 10,000 |
| Discount factor (ฮณ) | 0.99 |
| Learning rate | 1e-4 |
| Optimizer | Adam |

---

## ๐ŸŽฎ About the Environment

**Pixelcopter-PLE-v0** is a side-scrolling game where the agent controls a helicopter and must navigate through gaps in walls without crashing. 

- **Observation space:** 7 continuous values (player velocity, player y-position, wall positions, etc.)
- **Action space:** 2 discrete actions โ€” throttle up or do nothing
- **Reward:** +1 for each timestep survived
- **Episode ends:** On collision with a wall or the ground/ceiling

---

## ๐Ÿš€ How to Use

```python
from ple.games.pixelcopter import Pixelcopter
from ple import PLE
import torch

# Load the model
model = torch.load("model.pt", map_location=torch.device("cpu"))
model.eval()

# Run inference
state, _ = env.reset()
action, _ = model.act(state)
```

---

## ๐Ÿ“š Training Details

- **Framework:** PyTorch
- **Returns:** Standardized per episode for training stability
- **Environment API:** PyGame Learning Environment (PLE) via custom Gymnasium wrapper

---

## ๐Ÿ‘ค Author

Trained by **nirmanpatel** as part of the [Hugging Face Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/intro/README).