PPO Agent playing LunarLander-v2

This is a trained model of a PPO (Proximal Policy Optimization) agent playing LunarLander-v2 using Stable-Baselines3.

Results

  • Mean reward: 284.97 ± 14.71
  • Training timesteps: 50,000,000
  • Evaluation episodes: 10

The agent successfully learned to land the lunar module, achieving excellent performance well above the 200 reward threshold!

Environment Details

  • Environment: LunarLander-v2
  • Observation Space: Box(8,) - position, velocity, angle, angular velocity, leg contact
  • Action Space: Discrete(4) - do nothing, fire left engine, fire main engine, fire right engine
  • Goal: Land the lunar module safely between the flags

Model Details

  • Algorithm: PPO (Proximal Policy Optimization)
  • Policy: MlpPolicy (Multi-Layer Perceptron)
  • Framework: Stable-Baselines3
  • Device: CUDA (GPU acceleration)

Training Hyperparameters

  • Total timesteps: 50,000,000
  • Number of environments: 16 (vectorized)
  • Steps per update: 1,024
  • Batch size: 64
  • Number of epochs: 4
  • Gamma (discount factor): 0.999
  • GAE lambda: 0.98
  • Entropy coefficient: 0.01

Usage

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.monitor import Monitor
from huggingface_sb3 import load_from_hub

# Load the trained model
model = load_from_hub(
    repo_id="j-klawson/ppo-LunarLander-v2",
    filename="ppo-LunarLander-v2.zip"
)

# Create environment
env = DummyVecEnv([lambda: Monitor(gym.make("LunarLander-v2"))])

# Use the model
obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = env.step(action)
    env.render()
    if dones:
        break

Training Process

The agent was trained using PPO with the following process:

  1. Environment vectorization - 16 parallel environments for efficient data collection
  2. Policy optimization - Neural network learns optimal actions through policy gradients
  3. GPU acceleration - CUDA-enabled training for faster convergence
  4. Extended training - 50M timesteps to ensure robust performance

About

This model was trained as part of the Hugging Face Deep Reinforcement Learning Course - Unit 1.

Downloads last month
-
Video Preview
loading

Evaluation results