|
|
--- |
|
|
library_name: stable-baselines3 |
|
|
tags: |
|
|
- LunarLander-v2 |
|
|
- deep-reinforcement-learning |
|
|
- reinforcement-learning |
|
|
- stable-baselines3 |
|
|
model-index: |
|
|
- name: PPO |
|
|
results: |
|
|
- task: |
|
|
type: reinforcement-learning |
|
|
name: reinforcement-learning |
|
|
dataset: |
|
|
name: LunarLander-v2 |
|
|
type: LunarLander-v2 |
|
|
metrics: |
|
|
- type: mean_reward |
|
|
name: mean_reward |
|
|
value: 288.92 +/- 21.79 |
|
|
verified: false |
|
|
--- |
|
|
|
|
|
# π PPO Agent for LunarLander-v2 |
|
|
|
|
|
This is a trained **PPO agent** for the **LunarLander-v2** environment using Stable-Baselines3. |
|
|
|
|
|
## Developer |
|
|
**Vishand S (@Vishand03)** |
|
|
|
|
|
## Frameworks |
|
|
- Stable-Baselines3 |
|
|
- PyTorch |
|
|
|
|
|
## Training Details |
|
|
- Algorithm: PPO |
|
|
- Timesteps: 2.5M |
|
|
- Mean Reward: ~288.9 |
|
|
- Discount factor (Ξ³): 0.99 |
|
|
- Learning rate: 3e-4 |
|
|
- Optimizer: Adam |
|
|
|
|
|
--- |
|
|
|
|
|
## π₯ Demo (Preview) |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## π¬ Full Demo Video |
|
|
π [Watch the full video here](replay.mp4) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Usage |
|
|
|
|
|
```python |
|
|
import gymnasium as gym |
|
|
from stable_baselines3 import PPO |
|
|
from stable_baselines3.common.monitor import Monitor |
|
|
from stable_baselines3.common.evaluation import evaluate_policy |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
# ------------------------- |
|
|
# Environment Setup |
|
|
# ------------------------- |
|
|
env = gym.make("LunarLander-v2", render_mode="human") # Human render |
|
|
eval_env = Monitor(gym.make("LunarLander-v2")) # Evaluation (no render) |
|
|
|
|
|
# ------------------------- |
|
|
# Load pretrained model |
|
|
# ------------------------- |
|
|
model_path = hf_hub_download("Vishand03/lunarlander-ppo", "model.zip") |
|
|
model = PPO.load(model_path) |
|
|
|
|
|
# ------------------------- |
|
|
# Run one episode |
|
|
# ------------------------- |
|
|
obs, _ = env.reset() |
|
|
done = False |
|
|
while not done: |
|
|
action, _ = model.predict(obs, deterministic=True) |
|
|
obs, reward, terminated, truncated, _ = env.step(action) |
|
|
done = terminated or truncated |
|
|
|
|
|
# ------------------------- |
|
|
# Evaluate policy |
|
|
# ------------------------- |
|
|
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True) |
|
|
print(f"Mean Reward: {mean_reward:.2f} +/- {std_reward:.2f}") |
|
|
|