Update README.md

4062c8c verified 4 months ago

1.96 kB

library_name: stable-baselines3
tags:
  - LunarLander-v2
  - deep-reinforcement-learning
  - reinforcement-learning
  - stable-baselines3
model-index:
  - name: PPO
    results:
      - task:
          type: reinforcement-learning
          name: reinforcement-learning
        dataset:
          name: LunarLander-v2
          type: LunarLander-v2
        metrics:
          - type: mean_reward
            value: 279.70 +/- 18.00
            name: mean_reward
            verified: false

PPO Agent playing LunarLander-v2

This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library.

Hyperparameters used to train were optimized with Optuna.

{
  "learning_rate": 0.00038779746460731866,
  "n_steps": 2048,
  "batch_size": 128,
  "n_epochs": 13,
  "gamma": 0.9927390555180292,
  "gae_lambda": 0.9353501463066322,
  "clip_range": clip_range,
  "ent_coef": 0.007068533587811773,
  "policy_kwargs": {
    "net_arch": {'pi': [512, 512], 'vf': [512, 512]},
    "activation_fn": nn.Tanh
  },
}

Learning rate was used as an initial value for a linear scheduler during training. See this github issue for more information

Usage

import gymnasium as gym
from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

env_id = "LunarLander-v2"

model_fp = load_from_hub(
  repo_id="reeeemo/ppo-LunarLander-v2",
  filename="ppo-LunarLander-v2-optimized.zip",
)

model = PPO.load(model_fp, print_system_info=True)
eval_env = Monitor(gym.make(env_id))
mean_reward, std_reward = evaluate_policy(
  model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"Results: {mean_reward-std_reward:.2f}")
print(f"mean_reward: {mean_reward:.2f} +/- {std_reward}")
...