PPO Agent playing LunarLander-v2
This is a trained model of a PPO agent playing LunarLander-v2 using a custom implementation.
Usage
import torch
import gymnasium as gym
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
import numpy as np
# Define the Actor network
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, hidden_size=64):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, action_dim)
)
def forward(self, x):
return self.network(x)
# Load the model
checkpoint = torch.load("model.pt", map_location='cpu')
actor = Actor(state_dim=8, action_dim=4, hidden_size=checkpoint['config']['hidden_size'])
actor.load_state_dict(checkpoint['actor_state_dict'])
actor.eval()
# Test the agent
env = gym.make("LunarLander-v2")
state, _ = env.reset()
total_reward = 0
for _ in range(1000): # Max steps
with torch.no_grad():
state_tensor = torch.FloatTensor(state).unsqueeze(0)
logits = actor(state_tensor)
action = torch.argmax(logits, dim=-1).item()
state, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
if terminated or truncated:
break
print(f"Total reward: {total_reward:.2f}")
Training Results
- Mean reward: -9.92 ± 91.50
- Best reward: 143.33
- Success rate: 0.0% (episodes with reward > 200)
- Total episodes: 353
- Total timesteps: 100,000
Algorithm Configuration
- Algorithm: Proximal Policy Optimization (PPO)
- Learning rate: 0.0003
- Batch size: 2048
- Clip coefficient: 0.2
- Entropy coefficient: 0.01
- Value coefficient: 0.5
- Gamma: 0.99
- GAE Lambda: 0.95
Training Environment
- Environment: LunarLander-v2
- Framework: PyTorch + Gymnasium
- Training date: 2025-09-04
This model was trained as part of the Hugging Face Deep RL Course.
- Downloads last month
- 7
Evaluation results
- mean_reward on LunarLander-v2self-reported-9.92 +/- 91.50