๐Ÿš€ PPO Agent โ€“ LunarLander-v3

Reinforcement Learning Algorithm Framework Environment License


๐Ÿ“Œ Overview

This repository contains a trained Proximal Policy Optimization (PPO) agent for the LunarLander-v3 environment.

The agent was trained for 1,000,000 timesteps using 16 parallel environments to accelerate experience collection and stabilize training.

A gameplay replay video is included to demonstrate performance.


๐Ÿง  Algorithm

Proximal Policy Optimization (PPO) is an on-policy reinforcement learning algorithm that improves policy stability using a clipped objective function.

Key Characteristics:

  • On-policy algorithm
  • Uses clipped surrogate objective
  • Stable and sample-efficient
  • Works well with vectorized environments

Policy Used: MlpPolicy (Fully Connected Neural Network)
Framework: Stable-Baselines3
Environment API: Gymnasium


๐ŸŽฎ Environment Description

LunarLander-v3 (Discrete)

The agent must:

  • Control main and side thrusters
  • Adjust velocity and angle
  • Land softly between two flags
  • Minimize fuel usage

Reward Structure

The environment rewards:

  • Smooth landings
  • Staying within the landing zone
  • Efficient fuel usage

Penalizes:

  • Crashes
  • Flying out of bounds
  • Excessive fuel use

โš™๏ธ Training Setup

Parameter Value
Total Timesteps 1,000,000
Parallel Environments 16
Device CPU
Monitoring Enabled
Environment Type Discrete

Using 16 parallel environments significantly speeds up training and reduces variance in updates.


๐Ÿ”ฌ Hyperparameters

Hyperparameter Value Explanation
n_steps 1024 Rollout size per environment
batch_size 64 Minibatch size
n_epochs 4 Gradient update passes
gamma 0.999 Long-term reward emphasis
gae_lambda 0.98 Bias-variance tradeoff
ent_coef 0.01 Encourages exploration
policy MlpPolicy Fully connected network

Why These Values?

  • High gamma (0.999) โ†’ Focus on long-term stability
  • gae_lambda (0.98) โ†’ Balanced advantage estimation
  • ent_coef (0.01) โ†’ Prevents premature convergence
  • Multiple epochs (4) โ†’ Improves sample efficiency

๐Ÿ“Š Evaluation

The model was evaluated over 10 deterministic episodes using a monitored environment.

from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym

eval_env = Monitor(gym.make("LunarLander-v3", render_mode="rgb_array"))

mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True
)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

โ–ถ๏ธ Usage

1๏ธโƒฃ Install Dependencies

pip install stable-baselines3 gymnasium huggingface_sb3

2๏ธโƒฃ Load the Model

from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub
import gymnasium as gym

model_path = load_from_hub(
    repo_id="haitemR/ppo-LunarLander-v3",
    filename="ppo-LunarLander-v3.zip"
)

model = PPO.load(model_path)

env = gym.make("LunarLander-v3", render_mode="human")
obs, _ = env.reset()

for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, _ = env.step(action)
    if done or truncated:
        obs, _ = env.reset()

๐Ÿ“ Repository Files

  • ppo-LunarLander-v3.zip โ†’ Trained model
  • replay.mp4 โ†’ Gameplay demonstration

๐Ÿ“ˆ Performance

The trained agent achieves stable positive rewards and consistently lands within the target zone.

Training with vectorized environments significantly improved convergence speed.


๐Ÿ‘ค Author

Haitem R.


๐Ÿท๏ธ Tags

reinforcement-learning
ppo
lunar-lander
stable-baselines3
gymnasium
deep-rl


๐Ÿ“œ License

MIT License

Downloads last month
7
Video Preview
loading

Evaluation results