PPO Agent for Lunar Lander
A Proximal Policy Optimization (PPO) agent trained on the LunarLander-v3 environment from Gymnasium, implemented from scratch using PyTorch.
Model Description
This model implements a PPO agent with separate actor and critic networks, using Generalized Advantage Estimation (GAE) for stable and efficient training. The agent learns to land a lunar module safely between two flags on the moon's surface.
Architecture
Actor Network (Policy):
- Input: 8-dimensional state space (position, velocity, angle, angular velocity, leg contact)
- Hidden Layers: 256 → 256 neurons with ReLU activation
- Output: 4 action logits (do nothing, fire left engine, fire main engine, fire right engine)
- Distribution: Categorical distribution for action sampling
Critic Network (Value Function):
- Input: 8-dimensional state space
- Hidden Layers: 256 → 256 neurons with ReLU activation
- Output: Single value estimate
Training Configuration
{
"algorithm": "PPO",
"environment": "LunarLander-v3",
"total_episodes": 1000,
"n_steps": 128,
"batch_size": 16,
"epochs": 4,
"actor_hidden_size": 256,
"critic_hidden_size": 256,
"epsilon": 0.2,
"gae_lambda": 0.95,
"gamma": 0.99,
"actor_learning_rate": 0.0003,
"critic_learning_rate": 0.001,
"gradient_clip": 0.5,
"optimizer": "Adam"
}
Training Results
- Total Episodes Trained: 1000
- Last 100 Episodes Average Reward: 168.81
- Best Reward (Last 100): 304.53
- Average Episode Length (Last 100): 349.39 steps
The visualization shows the training performance over the last 100 episodes, including raw and smoothed rewards and episode durations.
Key Features
- Generalized Advantage Estimation (GAE): λ=0.95 for bias-variance tradeoff
- Clipped Surrogate Objective: ε=0.2 to prevent large policy updates
- Value Function Clipping: Stabilizes critic training
- Gradient Clipping: Clips gradients to 0.5 for stability
- Mini-batch Updates: 16 samples per batch with 4 epochs per update
- Separate Optimizers: Independent learning rates for actor and critic
Usage
Installation
pip install gymnasium torch numpy matplotlib
Loading and Using the Model
import torch
import torch.nn as nn
import gymnasium as gym
from huggingface_hub import hf_hub_download
# Define the actor network architecture (must match training)
class ActorNetwork(nn.Module):
def __init__(self, input_dims=8, output_dims=4, hidden_layer1=256, hidden_layer2=256):
super().__init__()
self.actor = nn.Sequential(
nn.Linear(input_dims, hidden_layer1),
nn.ReLU(),
nn.Linear(hidden_layer1, hidden_layer2),
nn.ReLU(),
nn.Linear(hidden_layer2, output_dims)
)
def forward(self, state):
dist = torch.distributions.Categorical(logits=self.actor(state))
return dist
# Download actor model
actor_path = hf_hub_download(
repo_id="ketencrypt10n/ppo-lunar-lander",
filename="actor_network.pth"
)
# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
actor = ActorNetwork(input_dims=8, output_dims=4).to(device)
actor.actor.load_state_dict(torch.load(actor_path, map_location=device))
actor.eval()
# Test the agent
env = gym.make("LunarLander-v3", render_mode="human")
for episode in range(30):
done = False
total_reward = 0
state, info = env.reset()
while not done:
state_tensor = torch.tensor(state, dtype=torch.float32, device=device)
with torch.no_grad():
dist = actor(state_tensor)
action = dist.sample()
state, reward, terminated, truncated, info = env.step(action.item())
total_reward += reward
done = terminated or truncated
print(f"Episode: {episode+1}, Total Reward: {total_reward:.2f}")
env.close()
Viewing Training Statistics
The repository includes statistics_ppo.png which shows the complete training visualization with:
- Episode rewards (raw and smoothed)
- Episode durations (raw and smoothed)
You can also load the data files for custom analysis:
import numpy as np
from huggingface_hub import hf_hub_download
# Download training history (last 100 episodes)
reward_path = hf_hub_download(repo_id="ketencrypt10n/ppo-lunar-lander", filename="reward_history_last100.npy")
duration_path = hf_hub_download(repo_id="ketencrypt10n/ppo-lunar-lander", filename="episode_durations_last100.npy")
rewards = np.load(reward_path)
durations = np.load(duration_path)
print(f"Average reward (last 100 episodes): {np.mean(rewards):.2f}")
print(f"Max reward (last 100 episodes): {np.max(rewards):.2f}")
print(f"Average duration (last 100 episodes): {np.mean(durations):.2f} steps")
Implementation Details
This implementation was built from scratch using:
- PyTorch for neural network implementation
- NumPy for numerical operations
- Gymnasium for the environment
- No high-level RL libraries (no Stable Baselines3, RLlib, etc.)
Key Implementation Choices
- GAE for Advantage Estimation: Uses λ=0.95 to balance bias and variance in advantage estimates
- Clipped PPO Objective: Prevents destructively large policy updates with ε=0.2 clipping
- Separate Actor-Critic Networks: Independent networks with different learning rates
- Mini-batch Training: Updates policy using 16-sample batches over 4 epochs
- Gradient Clipping: Clips gradients to 0.5 for both networks
- Learning Every N Steps: Updates after every 128 steps for better sample efficiency
PPO Algorithm
The PPO algorithm optimizes the clipped surrogate objective:
L^CLIP(θ) = E[min(r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t)]
where:
r_t(θ)is the probability ratio between new and old policiesÂ_tis the generalized advantage estimateεis the clipping parameter (0.2)
Environment
The Lunar Lander environment challenges the agent to:
- Land between two yellow flags
- Minimize fuel consumption
- Avoid crashes
- Control position and velocity
Reward Structure:
- Moving toward/away from landing pad: +/- reward
- Moving faster/slower: -/+ reward
- Legs touching ground: +10 each
- Side engine use: -0.03 per frame
- Main engine use: -0.3 per frame
- Crash: -100
- Safe landing: +100
Citation
@misc{ppo_lunar_lander,
author = {ketencrypt10n},
title = {PPO Agent for Lunar Lander},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ketencrypt10n/ppo-lunar-lander}}
}
License
This model is released for educational and research purposes.
- Downloads last month
- -
