PPO Agent playing LunarLander-v2

This is a trained model of a PPO agent playing LunarLander-v2 using a custom implementation.

Usage

import torch
import gymnasium as gym
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
import numpy as np

# Define the Actor network
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_size=64):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, action_dim)
        )
    
    def forward(self, x):
        return self.network(x)

# Load the model
checkpoint = torch.load("model.pt", map_location='cpu')
actor = Actor(state_dim=8, action_dim=4, hidden_size=checkpoint['config']['hidden_size'])
actor.load_state_dict(checkpoint['actor_state_dict'])
actor.eval()

# Test the agent
env = gym.make("LunarLander-v2")
state, _ = env.reset()
total_reward = 0

for _ in range(1000):  # Max steps
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        logits = actor(state_tensor)
        action = torch.argmax(logits, dim=-1).item()
    
    state, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    
    if terminated or truncated:
        break

print(f"Total reward: {total_reward:.2f}")

Training Results

Mean reward: -9.92 ± 91.50
Best reward: 143.33
Success rate: 0.0% (episodes with reward > 200)
Total episodes: 353
Total timesteps: 100,000

Algorithm Configuration

Algorithm: Proximal Policy Optimization (PPO)
Learning rate: 0.0003
Batch size: 2048
Clip coefficient: 0.2
Entropy coefficient: 0.01
Value coefficient: 0.5
Gamma: 0.99
GAE Lambda: 0.95

Training Environment

Environment: LunarLander-v2
Framework: PyTorch + Gymnasium
Training date: 2025-09-04

This model was trained as part of the Hugging Face Deep RL Course.

Downloads last month: -

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on LunarLander-v2
self-reported

-9.92 +/- 91.50