🚀 PPO Agent – LunarLander-v3

📌 Overview

This repository contains a trained Proximal Policy Optimization (PPO) agent for the LunarLander-v3 environment.

The agent was trained for 1,000,000 timesteps using 16 parallel environments to accelerate experience collection and stabilize training.

A gameplay replay video is included to demonstrate performance.

🧠 Algorithm

Proximal Policy Optimization (PPO) is an on-policy reinforcement learning algorithm that improves policy stability using a clipped objective function.

Key Characteristics:

On-policy algorithm
Uses clipped surrogate objective
Stable and sample-efficient
Works well with vectorized environments

Policy Used: MlpPolicy (Fully Connected Neural Network)
Framework: Stable-Baselines3
Environment API: Gymnasium

🎮 Environment Description

LunarLander-v3 (Discrete)

The agent must:

Control main and side thrusters
Adjust velocity and angle
Land softly between two flags
Minimize fuel usage

Reward Structure

The environment rewards:

Smooth landings
Staying within the landing zone
Efficient fuel usage

Penalizes:

Crashes
Flying out of bounds
Excessive fuel use

⚙️ Training Setup

Parameter	Value
Total Timesteps	1,000,000
Parallel Environments	16
Device	CPU
Monitoring	Enabled
Environment Type	Discrete

Using 16 parallel environments significantly speeds up training and reduces variance in updates.

🔬 Hyperparameters

Hyperparameter	Value	Explanation
`n_steps`	1024	Rollout size per environment
`batch_size`	64	Minibatch size
`n_epochs`	4	Gradient update passes
`gamma`	0.999	Long-term reward emphasis
`gae_lambda`	0.98	Bias-variance tradeoff
`ent_coef`	0.01	Encourages exploration
`policy`	MlpPolicy	Fully connected network

Why These Values?

High gamma (0.999) → Focus on long-term stability
gae_lambda (0.98) → Balanced advantage estimation
ent_coef (0.01) → Prevents premature convergence
Multiple epochs (4) → Improves sample efficiency

📊 Evaluation

The model was evaluated over 10 deterministic episodes using a monitored environment.

from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym

eval_env = Monitor(gym.make("LunarLander-v3", render_mode="rgb_array"))

mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True
)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

▶️ Usage

1️⃣ Install Dependencies

pip install stable-baselines3 gymnasium huggingface_sb3

2️⃣ Load the Model

from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub
import gymnasium as gym

model_path = load_from_hub(
    repo_id="haitemR/ppo-LunarLander-v3",
    filename="ppo-LunarLander-v3.zip"
)

model = PPO.load(model_path)

env = gym.make("LunarLander-v3", render_mode="human")
obs, _ = env.reset()

for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, _ = env.step(action)
    if done or truncated:
        obs, _ = env.reset()

📁 Repository Files

ppo-LunarLander-v3.zip → Trained model
replay.mp4 → Gameplay demonstration

📈 Performance

The trained agent achieves stable positive rewards and consistently lands within the target zone.

Training with vectorized environments significantly improved convergence speed.

👤 Author

Haitem R.

🏷️ Tags

reinforcement-learning
ppo
lunar-lander
stable-baselines3
gymnasium
deep-rl

📜 License

MIT License

Downloads last month: 1

Video Preview

Reinforcement Learning

Evaluation results

Mean Reward on LunarLander-v3
self-reported

244.590