PPO Agent playing LunarLander-v3

This is a trained PPO agent that plays LunarLander-v3, built with the stable-baselines3 library as part of the Hugging Face Deep Reinforcement Learning Course.

The agent achieves a mean reward of 254.57 ± 20.32 over 10 evaluation episodes, which clears the 200 point threshold commonly used as the "solved" criterion for this environment.

The Environment

LunarLander is a classic control task in which the agent must land a spacecraft on a designated pad between two flags. The observation is an 8 dimensional continuous vector containing position, velocity, angle, angular velocity, and left/right leg ground contact flags. The action space is discrete with four options: do nothing, fire the left orientation engine, fire the main engine, or fire the right orientation engine. Firing engines costs fuel (negative reward), crashing is heavily penalized, and a soft landing on the pad is strongly rewarded.

Usage with Stable-Baselines3

import gymnasium as gym
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub

repo_id = "NiseRoj/ppo-LunarLander-v3"
filename = "ppo-LunarLander-v3.zip"

# Download the trained model weights from the Hub
checkpoint = load_from_hub(repo_id=repo_id, filename=filename)
model = PPO.load(checkpoint)

# Roll out the policy in a rendered environment
env = gym.make("LunarLander-v3", render_mode="human")
obs, info = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
env.close()

Evaluating the Agent

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from huggingface_sb3 import load_from_hub

checkpoint = load_from_hub(
    repo_id="NiseRoj/ppo-LunarLander-v3",
    filename="ppo-LunarLander-v3.zip",
)
model = PPO.load(checkpoint)

eval_env = Monitor(gym.make("LunarLander-v3"))
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

Training Details

Setting Value
Algorithm PPO
Policy MlpPolicy
Environment LunarLander-v3
Parallel envs 16
Total timesteps 1,000,000
n_steps 1024
batch_size 64
n_epochs 4
gamma 0.999
gae_lambda 0.98
ent_coef 0.01

(Replace any of the rows above with the actual values you used if they differ. package_to_hub does not write these for you, so the record is only as accurate as you make it.)

Results

Metric Value
Mean reward 254.57
Std reward 20.32
Eval episodes 10
Solved (≥200)

Framework Versions

This model was trained and exported with:

  • stable-baselines3
  • gymnasium (with the box2d extra for LunarLander)
  • huggingface_sb3

See requirements.txt in this repository if you need the exact pinned versions.

Downloads last month
27
Video Preview
loading

Evaluation results