PPO Agent playing LunarLander-v2
This is a trained model of a PPO (Proximal Policy Optimization) agent playing LunarLander-v2 using Stable-Baselines3.
Results
- Mean reward: 284.97 ± 14.71
- Training timesteps: 50,000,000
- Evaluation episodes: 10
The agent successfully learned to land the lunar module, achieving excellent performance well above the 200 reward threshold!
Environment Details
- Environment: LunarLander-v2
- Observation Space: Box(8,) - position, velocity, angle, angular velocity, leg contact
- Action Space: Discrete(4) - do nothing, fire left engine, fire main engine, fire right engine
- Goal: Land the lunar module safely between the flags
Model Details
- Algorithm: PPO (Proximal Policy Optimization)
- Policy: MlpPolicy (Multi-Layer Perceptron)
- Framework: Stable-Baselines3
- Device: CUDA (GPU acceleration)
Training Hyperparameters
- Total timesteps: 50,000,000
- Number of environments: 16 (vectorized)
- Steps per update: 1,024
- Batch size: 64
- Number of epochs: 4
- Gamma (discount factor): 0.999
- GAE lambda: 0.98
- Entropy coefficient: 0.01
Usage
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.monitor import Monitor
from huggingface_sb3 import load_from_hub
# Load the trained model
model = load_from_hub(
repo_id="j-klawson/ppo-LunarLander-v2",
filename="ppo-LunarLander-v2.zip"
)
# Create environment
env = DummyVecEnv([lambda: Monitor(gym.make("LunarLander-v2"))])
# Use the model
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, rewards, dones, info = env.step(action)
env.render()
if dones:
break
Training Process
The agent was trained using PPO with the following process:
- Environment vectorization - 16 parallel environments for efficient data collection
- Policy optimization - Neural network learns optimal actions through policy gradients
- GPU acceleration - CUDA-enabled training for faster convergence
- Extended training - 50M timesteps to ensure robust performance
About
This model was trained as part of the Hugging Face Deep Reinforcement Learning Course - Unit 1.
- Downloads last month
- -
Evaluation results
- mean_reward on LunarLander-v2self-reported284.97 ± 14.71