PPO Agent for Lunar Lander

A Proximal Policy Optimization (PPO) agent trained on the LunarLander-v3 environment from Gymnasium, implemented from scratch using PyTorch.

Model Description

This model implements a PPO agent with separate actor and critic networks, using Generalized Advantage Estimation (GAE) for stable and efficient training. The agent learns to land a lunar module safely between two flags on the moon's surface.

Architecture

Actor Network (Policy):

Input: 8-dimensional state space (position, velocity, angle, angular velocity, leg contact)
Hidden Layers: 256 → 256 neurons with ReLU activation
Output: 4 action logits (do nothing, fire left engine, fire main engine, fire right engine)
Distribution: Categorical distribution for action sampling

Critic Network (Value Function):

Input: 8-dimensional state space
Hidden Layers: 256 → 256 neurons with ReLU activation
Output: Single value estimate

Training Configuration

{
  "algorithm": "PPO",
  "environment": "LunarLander-v3",
  "total_episodes": 1000,
  "n_steps": 128,
  "batch_size": 16,
  "epochs": 4,
  "actor_hidden_size": 256,
  "critic_hidden_size": 256,
  "epsilon": 0.2,
  "gae_lambda": 0.95,
  "gamma": 0.99,
  "actor_learning_rate": 0.0003,
  "critic_learning_rate": 0.001,
  "gradient_clip": 0.5,
  "optimizer": "Adam"
}

Training Results

Total Episodes Trained: 1000
Last 100 Episodes Average Reward: 168.81
Best Reward (Last 100): 304.53
Average Episode Length (Last 100): 349.39 steps

The visualization shows the training performance over the last 100 episodes, including raw and smoothed rewards and episode durations.

Key Features

Generalized Advantage Estimation (GAE): λ=0.95 for bias-variance tradeoff
Clipped Surrogate Objective: ε=0.2 to prevent large policy updates
Value Function Clipping: Stabilizes critic training
Gradient Clipping: Clips gradients to 0.5 for stability
Mini-batch Updates: 16 samples per batch with 4 epochs per update
Separate Optimizers: Independent learning rates for actor and critic

Usage

Installation

pip install gymnasium torch numpy matplotlib

Loading and Using the Model

import torch
import torch.nn as nn
import gymnasium as gym
from huggingface_hub import hf_hub_download

# Define the actor network architecture (must match training)
class ActorNetwork(nn.Module):
    def __init__(self, input_dims=8, output_dims=4, hidden_layer1=256, hidden_layer2=256):
        super().__init__()
        self.actor = nn.Sequential(
            nn.Linear(input_dims, hidden_layer1),
            nn.ReLU(),
            nn.Linear(hidden_layer1, hidden_layer2),
            nn.ReLU(),
            nn.Linear(hidden_layer2, output_dims)
        )
    
    def forward(self, state):
        dist = torch.distributions.Categorical(logits=self.actor(state))
        return dist

# Download actor model
actor_path = hf_hub_download(
    repo_id="ketencrypt10n/ppo-lunar-lander", 
    filename="actor_network.pth"
)

# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
actor = ActorNetwork(input_dims=8, output_dims=4).to(device)
actor.actor.load_state_dict(torch.load(actor_path, map_location=device))
actor.eval()

# Test the agent
env = gym.make("LunarLander-v3", render_mode="human")

for episode in range(30):
    done = False
    total_reward = 0
    state, info = env.reset()
    
    while not done:
        state_tensor = torch.tensor(state, dtype=torch.float32, device=device)
        with torch.no_grad():
            dist = actor(state_tensor)
            action = dist.sample()
        
        state, reward, terminated, truncated, info = env.step(action.item())
        total_reward += reward
        done = terminated or truncated
    
    print(f"Episode: {episode+1}, Total Reward: {total_reward:.2f}")

env.close()

Viewing Training Statistics

The repository includes statistics_ppo.png which shows the complete training visualization with:

Episode rewards (raw and smoothed)
Episode durations (raw and smoothed)

You can also load the data files for custom analysis:

import numpy as np
from huggingface_hub import hf_hub_download

# Download training history (last 100 episodes)
reward_path = hf_hub_download(repo_id="ketencrypt10n/ppo-lunar-lander", filename="reward_history_last100.npy")
duration_path = hf_hub_download(repo_id="ketencrypt10n/ppo-lunar-lander", filename="episode_durations_last100.npy")

rewards = np.load(reward_path)
durations = np.load(duration_path)

print(f"Average reward (last 100 episodes): {np.mean(rewards):.2f}")
print(f"Max reward (last 100 episodes): {np.max(rewards):.2f}")
print(f"Average duration (last 100 episodes): {np.mean(durations):.2f} steps")

Implementation Details

This implementation was built from scratch using:

PyTorch for neural network implementation
NumPy for numerical operations
Gymnasium for the environment
No high-level RL libraries (no Stable Baselines3, RLlib, etc.)

Key Implementation Choices

GAE for Advantage Estimation: Uses λ=0.95 to balance bias and variance in advantage estimates
Clipped PPO Objective: Prevents destructively large policy updates with ε=0.2 clipping
Separate Actor-Critic Networks: Independent networks with different learning rates
Mini-batch Training: Updates policy using 16-sample batches over 4 epochs
Gradient Clipping: Clips gradients to 0.5 for both networks
Learning Every N Steps: Updates after every 128 steps for better sample efficiency

PPO Algorithm

The PPO algorithm optimizes the clipped surrogate objective:

L^CLIP(θ) = E[min(r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t)]

where:

r_t(θ) is the probability ratio between new and old policies
Â_t is the generalized advantage estimate
ε is the clipping parameter (0.2)

Environment

The Lunar Lander environment challenges the agent to:

Land between two yellow flags
Minimize fuel consumption
Avoid crashes
Control position and velocity

Reward Structure:

Moving toward/away from landing pad: +/- reward
Moving faster/slower: -/+ reward
Legs touching ground: +10 each
Side engine use: -0.03 per frame
Main engine use: -0.3 per frame
Crash: -100
Safe landing: +100

Citation

@misc{ppo_lunar_lander,
  author = {ketencrypt10n},
  title = {PPO Agent for Lunar Lander},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ketencrypt10n/ppo-lunar-lander}}
}

License

This model is released for educational and research purposes.

Downloads last month: 3

Video Preview

Reinforcement Learning