Proximal Policy Optimization (PPO) - LunarLander (From Scratch)

This is a PPO (Proximal Policy Optimization) agent trained from scratch on the LunarLander-v3 environment using PyTorch and Gymnasium.

Model Description

Algorithm: Proximal Policy Optimization (PPO)
Environment: LunarLander-v3
Implementation: From scratch using PyTorch
Training Episodes: 2,000
Average Reward: 261.80
Status: ✅ SOLVED (threshold: 200+)

Architecture

Actor Network (Policy)

Input (8) → FC(64) → Tanh → FC(64) → Tanh → Output(4) → Softmax
Total Parameters: ~4,500

Critic Network (Value Function)

Input (8) → FC(64) → Tanh → FC(64) → Tanh → Output(1)
Total Parameters: ~4,700

Total Model Parameters: 9,221

State Space: 8 continuous values (position, velocity, angle, angular velocity, leg contact)
Action Space: 4 discrete actions (do nothing, fire left, fire main, fire right)
Architecture: Actor-Critic with separate networks

Training Details

Hyperparameters

Learning Rate (Actor): 0.0003
Learning Rate (Critic): 0.001
Discount Factor (γ): 0.99
GAE Lambda (λ): 0.95
PPO Clip (ε): 0.2
Entropy Coefficient: 0.01
Value Loss Coefficient: 0.5
Batch Size: 64
Epochs per Update: 10
Steps per Episode: 2048

Key Techniques

Generalized Advantage Estimation (GAE): Better advantage estimation
Clipped Surrogate Objective: Prevents too large policy updates
Value Function Clipping: Stabilizes critic learning
Entropy Bonus: Encourages exploration
Multiple Epochs: Reuses collected data efficiently

Performance

The agent successfully solves LunarLander-v3:

Average Reward: 261.80 (solved at 200+)
Successfully lands between flags
Smooth descent with minimal fuel usage
Consistent performance across evaluations

Usage

import torch
import gymnasium as gym
from huggingface_hub import hf_hub_download

# Download model
model_path = hf_hub_download(
    repo_id="aryannzzz/ppo-lunarlander-scratch",
    filename="ppo_lunarlander.pth"
)

# Define Actor Network
class Actor(torch.nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        self.network = torch.nn.Sequential(
            torch.nn.Linear(state_dim, hidden_dim),
            torch.nn.Tanh(),
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.Tanh(),
            torch.nn.Linear(hidden_dim, action_dim),
            torch.nn.Softmax(dim=-1)
        )
    
    def forward(self, x):
        return self.network(x)

# Load model
checkpoint = torch.load(model_path, map_location='cpu')
actor = Actor(state_dim=8, action_dim=4)
actor.load_state_dict(checkpoint['actor_state_dict'])
actor.eval()

# Use the agent
env = gym.make('LunarLander-v3', render_mode='human')
state, _ = env.reset()

done = False
total_reward = 0

while not done:
    state_tensor = torch.FloatTensor(state).unsqueeze(0)
    with torch.no_grad():
        action_probs = actor(state_tensor)
    action = action_probs.argmax().item()  # Greedy
    
    state, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Total Reward: {total_reward:.2f}")
env.close()

Model Files

ppo_lunarlander.pth - Complete model checkpoint (actor + critic + optimizer states)
ppo_training.png - Training progress visualization

Training Code

This model was trained using custom PPO implementation:

Repository: RL-Competition-Starter
File: 04_ppo_from_scratch.py

Benchmarks

Metric	Value
Average Reward	261.80
Solved Threshold	200+
Training Episodes	2,000
Training Time	~12 minutes (GPU)
Parameters	9,221
Algorithm	PPO (Actor-Critic)

Why PPO?

PPO is one of the most popular RL algorithms because:

✅ More stable than vanilla policy gradients
✅ Sample efficient through multiple epochs
✅ Works well on both discrete and continuous actions
✅ Used in production (ChatGPT RLHF, OpenAI Five, etc.)

Limitations

Trained specifically for LunarLander-v3
Requires Box2D physics engine
May need hyperparameter tuning for other environments

Citation

If you use this model, please reference:

@misc{ppo-lunarlander-scratch,
  author = {aryannzzz},
  title = {PPO LunarLander Model from Scratch},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/aryannzzz/ppo-lunarlander-scratch}
}

Acknowledgments

Environment provided by Gymnasium (formerly OpenAI Gym). PPO algorithm originally introduced in:

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). 
Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Paper for aryannzzz/ppo-lunarlander-scratch

Proximal Policy Optimization Algorithms

Paper • 1707.06347 • Published Jul 20, 2017 • 11