๐ PPO Agent โ LunarLander-v3
๐ Overview
This repository contains a trained Proximal Policy Optimization (PPO) agent for the LunarLander-v3 environment.
The agent was trained for 1,000,000 timesteps using 16 parallel environments to accelerate experience collection and stabilize training.
A gameplay replay video is included to demonstrate performance.
๐ง Algorithm
Proximal Policy Optimization (PPO) is an on-policy reinforcement learning algorithm that improves policy stability using a clipped objective function.
Key Characteristics:
- On-policy algorithm
- Uses clipped surrogate objective
- Stable and sample-efficient
- Works well with vectorized environments
Policy Used: MlpPolicy (Fully Connected Neural Network)
Framework: Stable-Baselines3
Environment API: Gymnasium
๐ฎ Environment Description
LunarLander-v3 (Discrete)
The agent must:
- Control main and side thrusters
- Adjust velocity and angle
- Land softly between two flags
- Minimize fuel usage
Reward Structure
The environment rewards:
- Smooth landings
- Staying within the landing zone
- Efficient fuel usage
Penalizes:
- Crashes
- Flying out of bounds
- Excessive fuel use
โ๏ธ Training Setup
| Parameter | Value |
|---|---|
| Total Timesteps | 1,000,000 |
| Parallel Environments | 16 |
| Device | CPU |
| Monitoring | Enabled |
| Environment Type | Discrete |
Using 16 parallel environments significantly speeds up training and reduces variance in updates.
๐ฌ Hyperparameters
| Hyperparameter | Value | Explanation |
|---|---|---|
n_steps |
1024 | Rollout size per environment |
batch_size |
64 | Minibatch size |
n_epochs |
4 | Gradient update passes |
gamma |
0.999 | Long-term reward emphasis |
gae_lambda |
0.98 | Bias-variance tradeoff |
ent_coef |
0.01 | Encourages exploration |
policy |
MlpPolicy | Fully connected network |
Why These Values?
- High gamma (0.999) โ Focus on long-term stability
- gae_lambda (0.98) โ Balanced advantage estimation
- ent_coef (0.01) โ Prevents premature convergence
- Multiple epochs (4) โ Improves sample efficiency
๐ Evaluation
The model was evaluated over 10 deterministic episodes using a monitored environment.
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym
eval_env = Monitor(gym.make("LunarLander-v3", render_mode="rgb_array"))
mean_reward, std_reward = evaluate_policy(
model,
eval_env,
n_eval_episodes=10,
deterministic=True
)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
โถ๏ธ Usage
1๏ธโฃ Install Dependencies
pip install stable-baselines3 gymnasium huggingface_sb3
2๏ธโฃ Load the Model
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub
import gymnasium as gym
model_path = load_from_hub(
repo_id="haitemR/ppo-LunarLander-v3",
filename="ppo-LunarLander-v3.zip"
)
model = PPO.load(model_path)
env = gym.make("LunarLander-v3", render_mode="human")
obs, _ = env.reset()
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, truncated, _ = env.step(action)
if done or truncated:
obs, _ = env.reset()
๐ Repository Files
ppo-LunarLander-v3.zipโ Trained modelreplay.mp4โ Gameplay demonstration
๐ Performance
The trained agent achieves stable positive rewards and consistently lands within the target zone.
Training with vectorized environments significantly improved convergence speed.
๐ค Author
Haitem R.
๐ท๏ธ Tags
reinforcement-learningppolunar-landerstable-baselines3gymnasiumdeep-rl
๐ License
MIT License
- Downloads last month
- 7
Evaluation results
- Mean Reward on LunarLander-v3self-reported244.590