Proximal Policy Optimization (PPO) - LunarLander (From Scratch)
This is a PPO (Proximal Policy Optimization) agent trained from scratch on the LunarLander-v3 environment using PyTorch and Gymnasium.
Model Description
- Algorithm: Proximal Policy Optimization (PPO)
- Environment: LunarLander-v3
- Implementation: From scratch using PyTorch
- Training Episodes: 2,000
- Average Reward: 261.80
- Status: β SOLVED (threshold: 200+)
Architecture
Actor Network (Policy)
Input (8) β FC(64) β Tanh β FC(64) β Tanh β Output(4) β Softmax
Total Parameters: ~4,500
Critic Network (Value Function)
Input (8) β FC(64) β Tanh β FC(64) β Tanh β Output(1)
Total Parameters: ~4,700
Total Model Parameters: 9,221
- State Space: 8 continuous values (position, velocity, angle, angular velocity, leg contact)
- Action Space: 4 discrete actions (do nothing, fire left, fire main, fire right)
- Architecture: Actor-Critic with separate networks
Training Details
Hyperparameters
- Learning Rate (Actor): 0.0003
- Learning Rate (Critic): 0.001
- Discount Factor (Ξ³): 0.99
- GAE Lambda (Ξ»): 0.95
- PPO Clip (Ξ΅): 0.2
- Entropy Coefficient: 0.01
- Value Loss Coefficient: 0.5
- Batch Size: 64
- Epochs per Update: 10
- Steps per Episode: 2048
Key Techniques
- Generalized Advantage Estimation (GAE): Better advantage estimation
- Clipped Surrogate Objective: Prevents too large policy updates
- Value Function Clipping: Stabilizes critic learning
- Entropy Bonus: Encourages exploration
- Multiple Epochs: Reuses collected data efficiently
Performance
The agent successfully solves LunarLander-v3:
- Average Reward: 261.80 (solved at 200+)
- Successfully lands between flags
- Smooth descent with minimal fuel usage
- Consistent performance across evaluations
Usage
import torch
import gymnasium as gym
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(
repo_id="aryannzzz/ppo-lunarlander-scratch",
filename="ppo_lunarlander.pth"
)
# Define Actor Network
class Actor(torch.nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=64):
super().__init__()
self.network = torch.nn.Sequential(
torch.nn.Linear(state_dim, hidden_dim),
torch.nn.Tanh(),
torch.nn.Linear(hidden_dim, hidden_dim),
torch.nn.Tanh(),
torch.nn.Linear(hidden_dim, action_dim),
torch.nn.Softmax(dim=-1)
)
def forward(self, x):
return self.network(x)
# Load model
checkpoint = torch.load(model_path, map_location='cpu')
actor = Actor(state_dim=8, action_dim=4)
actor.load_state_dict(checkpoint['actor_state_dict'])
actor.eval()
# Use the agent
env = gym.make('LunarLander-v3', render_mode='human')
state, _ = env.reset()
done = False
total_reward = 0
while not done:
state_tensor = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
action_probs = actor(state_tensor)
action = action_probs.argmax().item() # Greedy
state, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
done = terminated or truncated
print(f"Total Reward: {total_reward:.2f}")
env.close()
Model Files
ppo_lunarlander.pth- Complete model checkpoint (actor + critic + optimizer states)ppo_training.png- Training progress visualization
Training Code
This model was trained using custom PPO implementation:
- Repository: RL-Competition-Starter
- File:
04_ppo_from_scratch.py
Benchmarks
| Metric | Value |
|---|---|
| Average Reward | 261.80 |
| Solved Threshold | 200+ |
| Training Episodes | 2,000 |
| Training Time | ~12 minutes (GPU) |
| Parameters | 9,221 |
| Algorithm | PPO (Actor-Critic) |
Why PPO?
PPO is one of the most popular RL algorithms because:
- β More stable than vanilla policy gradients
- β Sample efficient through multiple epochs
- β Works well on both discrete and continuous actions
- β Used in production (ChatGPT RLHF, OpenAI Five, etc.)
Limitations
- Trained specifically for LunarLander-v3
- Requires Box2D physics engine
- May need hyperparameter tuning for other environments
Citation
If you use this model, please reference:
@misc{ppo-lunarlander-scratch,
author = {aryannzzz},
title = {PPO LunarLander Model from Scratch},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/aryannzzz/ppo-lunarlander-scratch}
}
Acknowledgments
Environment provided by Gymnasium (formerly OpenAI Gym). PPO algorithm originally introduced in:
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017).
Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
