PPO Agent playing LunarLander-v3

This is a trained PPO agent that plays LunarLander-v3, built with the stable-baselines3 library as part of the Hugging Face Deep Reinforcement Learning Course.

The agent achieves a mean reward of 254.57 ± 20.32 over 10 evaluation episodes, which clears the 200 point threshold commonly used as the "solved" criterion for this environment.

The Environment

LunarLander is a classic control task in which the agent must land a spacecraft on a designated pad between two flags. The observation is an 8 dimensional continuous vector containing position, velocity, angle, angular velocity, and left/right leg ground contact flags. The action space is discrete with four options: do nothing, fire the left orientation engine, fire the main engine, or fire the right orientation engine. Firing engines costs fuel (negative reward), crashing is heavily penalized, and a soft landing on the pad is strongly rewarded.

Usage with Stable-Baselines3

import gymnasium as gym
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub

repo_id = "NiseRoj/ppo-LunarLander-v3"
filename = "ppo-LunarLander-v3.zip"

# Download the trained model weights from the Hub
checkpoint = load_from_hub(repo_id=repo_id, filename=filename)
model = PPO.load(checkpoint)

# Roll out the policy in a rendered environment
env = gym.make("LunarLander-v3", render_mode="human")
obs, info = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
env.close()

Evaluating the Agent

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from huggingface_sb3 import load_from_hub

checkpoint = load_from_hub(
    repo_id="NiseRoj/ppo-LunarLander-v3",
    filename="ppo-LunarLander-v3.zip",
)
model = PPO.load(checkpoint)

eval_env = Monitor(gym.make("LunarLander-v3"))
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

Training Details

Setting	Value
Algorithm	PPO
Policy	MlpPolicy
Environment	LunarLander-v3
Parallel envs	16
Total timesteps	1,000,000
n_steps	1024
batch_size	64
n_epochs	4
gamma	0.999
gae_lambda	0.98
ent_coef	0.01

(Replace any of the rows above with the actual values you used if they differ. package_to_hub does not write these for you, so the record is only as accurate as you make it.)

Results

Metric	Value
Mean reward	254.57
Std reward	20.32
Eval episodes	10
Solved (≥200)	✅

Framework Versions

This model was trained and exported with:

stable-baselines3
gymnasium (with the box2d extra for LunarLander)
huggingface_sb3

See requirements.txt in this repository if you need the exact pinned versions.

Downloads last month: 27

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on LunarLander-v3
self-reported

254.57 +/- 20.32