PPO Agent playing LunarLander-v2

This is a trained model of a PPO (Proximal Policy Optimization) agent playing LunarLander-v2 using Stable-Baselines3.

Results

Mean reward: 284.97 ± 14.71
Training timesteps: 50,000,000
Evaluation episodes: 10

The agent successfully learned to land the lunar module, achieving excellent performance well above the 200 reward threshold!

Environment Details

Environment: LunarLander-v2
Observation Space: Box(8,) - position, velocity, angle, angular velocity, leg contact
Action Space: Discrete(4) - do nothing, fire left engine, fire main engine, fire right engine
Goal: Land the lunar module safely between the flags

Model Details

Algorithm: PPO (Proximal Policy Optimization)
Policy: MlpPolicy (Multi-Layer Perceptron)
Framework: Stable-Baselines3
Device: CUDA (GPU acceleration)

Training Hyperparameters

Total timesteps: 50,000,000
Number of environments: 16 (vectorized)
Steps per update: 1,024
Batch size: 64
Number of epochs: 4
Gamma (discount factor): 0.999
GAE lambda: 0.98
Entropy coefficient: 0.01

Usage

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.monitor import Monitor
from huggingface_sb3 import load_from_hub

# Load the trained model
model = load_from_hub(
    repo_id="j-klawson/ppo-LunarLander-v2",
    filename="ppo-LunarLander-v2.zip"
)

# Create environment
env = DummyVecEnv([lambda: Monitor(gym.make("LunarLander-v2"))])

# Use the model
obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = env.step(action)
    env.render()
    if dones:
        break

Training Process

The agent was trained using PPO with the following process:

Environment vectorization - 16 parallel environments for efficient data collection
Policy optimization - Neural network learns optimal actions through policy gradients
GPU acceleration - CUDA-enabled training for faster convergence
Extended training - 50M timesteps to ensure robust performance

About

This model was trained as part of the Hugging Face Deep Reinforcement Learning Course - Unit 1.

Downloads last month: -

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on LunarLander-v2
self-reported

284.97 ± 14.71