PPO Agent for Lunar Lander

A Proximal Policy Optimization (PPO) agent trained on the LunarLander-v3 environment from Gymnasium, implemented from scratch using PyTorch.

Model Description

This model implements a PPO agent with separate actor and critic networks, using Generalized Advantage Estimation (GAE) for stable and efficient training. The agent learns to land a lunar module safely between two flags on the moon's surface.

Architecture

Actor Network (Policy):

  • Input: 8-dimensional state space (position, velocity, angle, angular velocity, leg contact)
  • Hidden Layers: 256 → 256 neurons with ReLU activation
  • Output: 4 action logits (do nothing, fire left engine, fire main engine, fire right engine)
  • Distribution: Categorical distribution for action sampling

Critic Network (Value Function):

  • Input: 8-dimensional state space
  • Hidden Layers: 256 → 256 neurons with ReLU activation
  • Output: Single value estimate

Training Configuration

{
  "algorithm": "PPO",
  "environment": "LunarLander-v3",
  "total_episodes": 1000,
  "n_steps": 128,
  "batch_size": 16,
  "epochs": 4,
  "actor_hidden_size": 256,
  "critic_hidden_size": 256,
  "epsilon": 0.2,
  "gae_lambda": 0.95,
  "gamma": 0.99,
  "actor_learning_rate": 0.0003,
  "critic_learning_rate": 0.001,
  "gradient_clip": 0.5,
  "optimizer": "Adam"
}

Training Results

  • Total Episodes Trained: 1000
  • Last 100 Episodes Average Reward: 168.81
  • Best Reward (Last 100): 304.53
  • Average Episode Length (Last 100): 349.39 steps

Training Statistics

The visualization shows the training performance over the last 100 episodes, including raw and smoothed rewards and episode durations.

Key Features

  • Generalized Advantage Estimation (GAE): λ=0.95 for bias-variance tradeoff
  • Clipped Surrogate Objective: ε=0.2 to prevent large policy updates
  • Value Function Clipping: Stabilizes critic training
  • Gradient Clipping: Clips gradients to 0.5 for stability
  • Mini-batch Updates: 16 samples per batch with 4 epochs per update
  • Separate Optimizers: Independent learning rates for actor and critic

Usage

Installation

pip install gymnasium torch numpy matplotlib

Loading and Using the Model

import torch
import torch.nn as nn
import gymnasium as gym
from huggingface_hub import hf_hub_download

# Define the actor network architecture (must match training)
class ActorNetwork(nn.Module):
    def __init__(self, input_dims=8, output_dims=4, hidden_layer1=256, hidden_layer2=256):
        super().__init__()
        self.actor = nn.Sequential(
            nn.Linear(input_dims, hidden_layer1),
            nn.ReLU(),
            nn.Linear(hidden_layer1, hidden_layer2),
            nn.ReLU(),
            nn.Linear(hidden_layer2, output_dims)
        )
    
    def forward(self, state):
        dist = torch.distributions.Categorical(logits=self.actor(state))
        return dist

# Download actor model
actor_path = hf_hub_download(
    repo_id="ketencrypt10n/ppo-lunar-lander", 
    filename="actor_network.pth"
)

# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
actor = ActorNetwork(input_dims=8, output_dims=4).to(device)
actor.actor.load_state_dict(torch.load(actor_path, map_location=device))
actor.eval()

# Test the agent
env = gym.make("LunarLander-v3", render_mode="human")

for episode in range(30):
    done = False
    total_reward = 0
    state, info = env.reset()
    
    while not done:
        state_tensor = torch.tensor(state, dtype=torch.float32, device=device)
        with torch.no_grad():
            dist = actor(state_tensor)
            action = dist.sample()
        
        state, reward, terminated, truncated, info = env.step(action.item())
        total_reward += reward
        done = terminated or truncated
    
    print(f"Episode: {episode+1}, Total Reward: {total_reward:.2f}")

env.close()

Viewing Training Statistics

The repository includes statistics_ppo.png which shows the complete training visualization with:

  • Episode rewards (raw and smoothed)
  • Episode durations (raw and smoothed)

You can also load the data files for custom analysis:

import numpy as np
from huggingface_hub import hf_hub_download

# Download training history (last 100 episodes)
reward_path = hf_hub_download(repo_id="ketencrypt10n/ppo-lunar-lander", filename="reward_history_last100.npy")
duration_path = hf_hub_download(repo_id="ketencrypt10n/ppo-lunar-lander", filename="episode_durations_last100.npy")

rewards = np.load(reward_path)
durations = np.load(duration_path)

print(f"Average reward (last 100 episodes): {np.mean(rewards):.2f}")
print(f"Max reward (last 100 episodes): {np.max(rewards):.2f}")
print(f"Average duration (last 100 episodes): {np.mean(durations):.2f} steps")

Implementation Details

This implementation was built from scratch using:

  • PyTorch for neural network implementation
  • NumPy for numerical operations
  • Gymnasium for the environment
  • No high-level RL libraries (no Stable Baselines3, RLlib, etc.)

Key Implementation Choices

  1. GAE for Advantage Estimation: Uses λ=0.95 to balance bias and variance in advantage estimates
  2. Clipped PPO Objective: Prevents destructively large policy updates with ε=0.2 clipping
  3. Separate Actor-Critic Networks: Independent networks with different learning rates
  4. Mini-batch Training: Updates policy using 16-sample batches over 4 epochs
  5. Gradient Clipping: Clips gradients to 0.5 for both networks
  6. Learning Every N Steps: Updates after every 128 steps for better sample efficiency

PPO Algorithm

The PPO algorithm optimizes the clipped surrogate objective:

L^CLIP(θ) = E[min(r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t)]

where:

  • r_t(θ) is the probability ratio between new and old policies
  • Â_t is the generalized advantage estimate
  • ε is the clipping parameter (0.2)

Environment

The Lunar Lander environment challenges the agent to:

  • Land between two yellow flags
  • Minimize fuel consumption
  • Avoid crashes
  • Control position and velocity

Reward Structure:

  • Moving toward/away from landing pad: +/- reward
  • Moving faster/slower: -/+ reward
  • Legs touching ground: +10 each
  • Side engine use: -0.03 per frame
  • Main engine use: -0.3 per frame
  • Crash: -100
  • Safe landing: +100

Citation

@misc{ppo_lunar_lander,
  author = {ketencrypt10n},
  title = {PPO Agent for Lunar Lander},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ketencrypt10n/ppo-lunar-lander}}
}

License

This model is released for educational and research purposes.

Downloads last month
-
Video Preview
loading