Proximal Policy Optimization (PPO) - LunarLander (From Scratch)

This is a PPO (Proximal Policy Optimization) agent trained from scratch on the LunarLander-v3 environment using PyTorch and Gymnasium.

Model Description

  • Algorithm: Proximal Policy Optimization (PPO)
  • Environment: LunarLander-v3
  • Implementation: From scratch using PyTorch
  • Training Episodes: 2,000
  • Average Reward: 261.80
  • Status: βœ… SOLVED (threshold: 200+)

Architecture

Actor Network (Policy)

Input (8) β†’ FC(64) β†’ Tanh β†’ FC(64) β†’ Tanh β†’ Output(4) β†’ Softmax
Total Parameters: ~4,500

Critic Network (Value Function)

Input (8) β†’ FC(64) β†’ Tanh β†’ FC(64) β†’ Tanh β†’ Output(1)
Total Parameters: ~4,700

Total Model Parameters: 9,221

  • State Space: 8 continuous values (position, velocity, angle, angular velocity, leg contact)
  • Action Space: 4 discrete actions (do nothing, fire left, fire main, fire right)
  • Architecture: Actor-Critic with separate networks

Training Details

Hyperparameters

  • Learning Rate (Actor): 0.0003
  • Learning Rate (Critic): 0.001
  • Discount Factor (Ξ³): 0.99
  • GAE Lambda (Ξ»): 0.95
  • PPO Clip (Ξ΅): 0.2
  • Entropy Coefficient: 0.01
  • Value Loss Coefficient: 0.5
  • Batch Size: 64
  • Epochs per Update: 10
  • Steps per Episode: 2048

Key Techniques

  • Generalized Advantage Estimation (GAE): Better advantage estimation
  • Clipped Surrogate Objective: Prevents too large policy updates
  • Value Function Clipping: Stabilizes critic learning
  • Entropy Bonus: Encourages exploration
  • Multiple Epochs: Reuses collected data efficiently

Performance

The agent successfully solves LunarLander-v3:

  • Average Reward: 261.80 (solved at 200+)
  • Successfully lands between flags
  • Smooth descent with minimal fuel usage
  • Consistent performance across evaluations

Training Progress

Usage

import torch
import gymnasium as gym
from huggingface_hub import hf_hub_download

# Download model
model_path = hf_hub_download(
    repo_id="aryannzzz/ppo-lunarlander-scratch",
    filename="ppo_lunarlander.pth"
)

# Define Actor Network
class Actor(torch.nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        self.network = torch.nn.Sequential(
            torch.nn.Linear(state_dim, hidden_dim),
            torch.nn.Tanh(),
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.Tanh(),
            torch.nn.Linear(hidden_dim, action_dim),
            torch.nn.Softmax(dim=-1)
        )
    
    def forward(self, x):
        return self.network(x)

# Load model
checkpoint = torch.load(model_path, map_location='cpu')
actor = Actor(state_dim=8, action_dim=4)
actor.load_state_dict(checkpoint['actor_state_dict'])
actor.eval()

# Use the agent
env = gym.make('LunarLander-v3', render_mode='human')
state, _ = env.reset()

done = False
total_reward = 0

while not done:
    state_tensor = torch.FloatTensor(state).unsqueeze(0)
    with torch.no_grad():
        action_probs = actor(state_tensor)
    action = action_probs.argmax().item()  # Greedy
    
    state, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Total Reward: {total_reward:.2f}")
env.close()

Model Files

  • ppo_lunarlander.pth - Complete model checkpoint (actor + critic + optimizer states)
  • ppo_training.png - Training progress visualization

Training Code

This model was trained using custom PPO implementation:

Benchmarks

Metric Value
Average Reward 261.80
Solved Threshold 200+
Training Episodes 2,000
Training Time ~12 minutes (GPU)
Parameters 9,221
Algorithm PPO (Actor-Critic)

Why PPO?

PPO is one of the most popular RL algorithms because:

  • βœ… More stable than vanilla policy gradients
  • βœ… Sample efficient through multiple epochs
  • βœ… Works well on both discrete and continuous actions
  • βœ… Used in production (ChatGPT RLHF, OpenAI Five, etc.)

Limitations

  • Trained specifically for LunarLander-v3
  • Requires Box2D physics engine
  • May need hyperparameter tuning for other environments

Citation

If you use this model, please reference:

@misc{ppo-lunarlander-scratch,
  author = {aryannzzz},
  title = {PPO LunarLander Model from Scratch},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/aryannzzz/ppo-lunarlander-scratch}
}

Acknowledgments

Environment provided by Gymnasium (formerly OpenAI Gym). PPO algorithm originally introduced in:

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). 
Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading