articblue commited on
Commit
eaf3a9a
·
1 Parent(s): f259fa6

RLLander-v2.py upload. Its a jupyter collab notebook export so its just there for knowledge of the base operations

Browse files
Files changed (1) hide show
  1. RLlander-v2.py +633 -0
RLlander-v2.py ADDED
@@ -0,0 +1,633 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """Copy of unit1.ipynb
3
+
4
+ Automatically generated by Colaboratory.
5
+
6
+ Original file is located at
7
+ https://colab.research.google.com/drive/1_RtxD6AEBoDSooM2wtpSWv8CBQf8FH45
8
+
9
+ # Unit 1: Train your first Deep Reinforcement Learning Agent 🤖
10
+
11
+ ![Cover](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/thumbnail.jpg)
12
+
13
+ In this notebook, you'll train your **first Deep Reinforcement Learning agent** a Lunar Lander agent that will learn to **land correctly on the Moon 🌕**. Using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) a Deep Reinforcement Learning library, share them with the community, and experiment with different configurations
14
+
15
+ ⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
16
+ """
17
+
18
+ # Commented out IPython magic to ensure Python compatibility.
19
+ # %%html
20
+ # <video controls autoplay><source src="https://huggingface.co/sb3/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>
21
+
22
+ """### The environment 🎮
23
+
24
+ - [LunarLander-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
25
+
26
+ ### The library used 📚
27
+
28
+ - [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)
29
+
30
+ We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
31
+
32
+ ## Objectives of this notebook 🏆
33
+
34
+ At the end of the notebook, you will:
35
+
36
+ - Be able to use **Gymnasium**, the environment library.
37
+ - Be able to use **Stable-Baselines3**, the deep reinforcement learning library.
38
+ - Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score 🔥.
39
+
40
+ ## This notebook is from Deep Reinforcement Learning Course
41
+
42
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
43
+
44
+ In this free course, you will:
45
+
46
+ - 📖 Study Deep Reinforcement Learning in **theory and practice**.
47
+ - 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
48
+ - 🤖 Train **agents in unique environments**
49
+ - 🎓 **Earn a certificate of completion** by completing 80% of the assignments.
50
+
51
+ And more!
52
+
53
+ Check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
54
+
55
+ Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
56
+
57
+ The best way to keep in touch and ask questions is **to join our discord server** to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
58
+
59
+ ## Prerequisites 🏗️
60
+
61
+ Before diving into the notebook, you need to:
62
+
63
+ 🔲 📝 **[Read Unit 0](https://huggingface.co/deep-rl-course/unit0/introduction)** that gives you all the **information about the course and helps you to onboard** 🤗
64
+
65
+ 🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by [reading Unit 1](https://huggingface.co/deep-rl-course/unit1/introduction).
66
+
67
+ ## A small recap of Deep Reinforcement Learning 📚
68
+
69
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
70
+
71
+ Let's do a small recap on what we learned in the first Unit:
72
+
73
+ - Reinforcement Learning is a **computational approach to learning from actions**. We build an agent that learns from the environment by **interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
74
+
75
+ - The goal of any RL agent is to **maximize its expected cumulative reward** (also called expected return) because RL is based on the _reward hypothesis_, which is that all goals can be described as the maximization of an expected cumulative reward.
76
+
77
+ - The RL process is a **loop that outputs a sequence of state, action, reward, and next state**.
78
+
79
+ - To calculate the expected cumulative reward (expected return), **we discount the rewards**: the rewards that come sooner (at the beginning of the game) are more probable to happen since they are more predictable than the long-term future reward.
80
+
81
+ - To solve an RL problem, you want to **find an optimal policy**; the policy is the "brain" of your AI that will tell us what action to take given a state. The optimal one is the one that gives you the actions that max the expected return.
82
+
83
+ There are **two** ways to find your optimal policy:
84
+
85
+ - By **training your policy directly**: policy-based methods.
86
+ - By **training a value function** that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.
87
+
88
+ - Finally, we spoke about Deep RL because **we introduce deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based) hence the name "deep."**
89
+
90
+ # Let's train our first Deep Reinforcement Learning agent and upload it to the Hub 🚀
91
+
92
+ ## Get a certificate 🎓
93
+
94
+ To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.
95
+
96
+ To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
97
+
98
+ For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
99
+
100
+ ## Set the GPU 💪
101
+
102
+ - To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
103
+
104
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
105
+
106
+ - `Hardware Accelerator > GPU`
107
+
108
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
109
+
110
+ ## Install dependencies and create a virtual screen 🔽
111
+
112
+ The first step is to install the dependencies, we’ll install multiple ones.
113
+
114
+ - `gymnasium[box2d]`: Contains the LunarLander-v2 environment 🌛
115
+ - `stable-baselines3[extra]`: The deep reinforcement learning library.
116
+ - `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
117
+
118
+ To make things easier, we created a script to install all these dependencies.
119
+ """
120
+
121
+ !apt install swig cmake
122
+
123
+ !pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt
124
+
125
+ """During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
126
+
127
+ Hence the following cell will install virtual screen libraries and create and run a virtual screen 🖥
128
+ """
129
+
130
+ !sudo apt-get update
131
+ !sudo apt-get install -y python3-opengl
132
+ !apt install ffmpeg
133
+ !apt install xvfb
134
+ !pip3 install pyvirtualdisplay
135
+
136
+ """To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**"""
137
+
138
+ import os
139
+ os.kill(os.getpid(), 9)
140
+
141
+ # Virtual display
142
+ from pyvirtualdisplay import Display
143
+
144
+ virtual_display = Display(visible=0, size=(1400, 900))
145
+ virtual_display.start()
146
+
147
+ """## Import the packages 📦
148
+
149
+ One additional library we import is huggingface_hub **to be able to upload and download trained models from the hub**.
150
+
151
+
152
+ The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
153
+
154
+ You can see here all the Deep reinforcement Learning models available here👉 https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads
155
+
156
+
157
+ """
158
+
159
+ import gymnasium
160
+
161
+ from huggingface_sb3 import load_from_hub, package_to_hub
162
+ from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
163
+
164
+ from stable_baselines3 import PPO
165
+ from stable_baselines3.common.env_util import make_vec_env
166
+ from stable_baselines3.common.evaluation import evaluate_policy
167
+ from stable_baselines3.common.monitor import Monitor
168
+
169
+ """## Understand Gymnasium and how it works 🤖
170
+
171
+ 🏋 The library containing our environment is called Gymnasium.
172
+ **You'll use Gymnasium a lot in Deep Reinforcement Learning.**
173
+
174
+ Gymnasium is the **new version of Gym library** [maintained by the Farama Foundation](https://farama.org/).
175
+
176
+ The Gymnasium library provides two things:
177
+
178
+ - An interface that allows you to **create RL environments**.
179
+ - A **collection of environments** (gym-control, atari, box2D...).
180
+
181
+ Let's look at an example, but first let's recall the RL loop.
182
+
183
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
184
+
185
+ At each step:
186
+ - Our Agent receives a **state (S0)** from the **Environment** — we receive the first frame of our game (Environment).
187
+ - Based on that **state (S0),** the Agent takes an **action (A0)** — our Agent will move to the right.
188
+ - The environment transitions to a **new** **state (S1)** — new frame.
189
+ - The environment gives some **reward (R1)** to the Agent — we’re not dead *(Positive Reward +1)*.
190
+
191
+
192
+ With Gymnasium:
193
+
194
+ 1️⃣ We create our environment using `gymnasium.make()`
195
+
196
+ 2️⃣ We reset the environment to its initial state with `observation = env.reset()`
197
+
198
+ At each step:
199
+
200
+ 3️⃣ Get an action using our model (in our example we take a random action)
201
+
202
+ 4️⃣ Using `env.step(action)`, we perform this action in the environment and get
203
+ - `observation`: The new state (st+1)
204
+ - `reward`: The reward we get after executing the action
205
+ - `terminated`: Indicates if the episode terminated (agent reach the terminal state)
206
+ - `truncated`: Introduced with this new version, it indicates a timelimit or if an agent go out of bounds of the environment for instance.
207
+ - `info`: A dictionary that provides additional information (depends on the environment).
208
+
209
+ For more explanations check this 👉 https://gymnasium.farama.org/api/env/#gymnasium.Env.step
210
+
211
+ If the episode is terminated:
212
+ - We reset the environment to its initial state with `observation = env.reset()`
213
+
214
+ **Let's look at an example!** Make sure to read the code
215
+ """
216
+
217
+ import gymnasium as gym
218
+
219
+ # First, we create our environment called LunarLander-v2
220
+ env = gym.make("LunarLander-v2")
221
+
222
+ # Then we reset this environment
223
+ observation, info = env.reset()
224
+
225
+ for _ in range(20):
226
+ # Take a random action
227
+ action = env.action_space.sample()
228
+ print("Action taken:", action)
229
+
230
+ # Do this action in the environment and get
231
+ # next_state, reward, terminated, truncated and info
232
+ observation, reward, terminated, truncated, info = env.step(action)
233
+
234
+ # If the game is terminated (in our case we land, crashed) or truncated (timeout)
235
+ if terminated or truncated:
236
+ # Reset the environment
237
+ print("Environment is reset")
238
+ observation, info = env.reset()
239
+
240
+ env.close()
241
+
242
+ """## Create the LunarLander environment 🌛 and understand how it works
243
+
244
+ ### [The environment 🎮](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
245
+
246
+ In this first tutorial, we’re going to train our agent, a [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/), **to land correctly on the moon**. To do that, the agent needs to learn **to adapt its speed and position (horizontal, vertical, and angular) to land correctly.**
247
+
248
+ ---
249
+
250
+
251
+ 💡 A good habit when you start to use an environment is to check its documentation
252
+
253
+ 👉 https://gymnasium.farama.org/environments/box2d/lunar_lander/
254
+
255
+ ---
256
+
257
+ Let's see what the Environment looks like:
258
+ """
259
+
260
+ # We create our environment with gym.make("<name_of_the_environment>")
261
+ env = gym.make("LunarLander-v2")
262
+ env.reset()
263
+ print("_____OBSERVATION SPACE_____ \n")
264
+ print("Observation Space Shape", env.observation_space.shape)
265
+ print("Sample observation", env.observation_space.sample()) # Get a random observation
266
+
267
+ """We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:
268
+ - Horizontal pad coordinate (x)
269
+ - Vertical pad coordinate (y)
270
+ - Horizontal speed (x)
271
+ - Vertical speed (y)
272
+ - Angle
273
+ - Angular speed
274
+ - If the left leg contact point has touched the land (boolean)
275
+ - If the right leg contact point has touched the land (boolean)
276
+
277
+ """
278
+
279
+ print("\n _____ACTION SPACE_____ \n")
280
+ print("Action Space Shape", env.action_space.n)
281
+ print("Action Space Sample", env.action_space.sample()) # Take a random action
282
+
283
+ """The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
284
+
285
+ - Action 0: Do nothing,
286
+ - Action 1: Fire left orientation engine,
287
+ - Action 2: Fire the main engine,
288
+ - Action 3: Fire right orientation engine.
289
+
290
+ Reward function (the function that will gives a reward at each timestep) 💰:
291
+
292
+ After every step a reward is granted. The total reward of an episode is the **sum of the rewards for all the steps within that episode**.
293
+
294
+ For each step, the reward:
295
+
296
+ - Is increased/decreased the closer/further the lander is to the landing pad.
297
+ - Is increased/decreased the slower/faster the lander is moving.
298
+ - Is decreased the more the lander is tilted (angle not horizontal).
299
+ - Is increased by 10 points for each leg that is in contact with the ground.
300
+ - Is decreased by 0.03 points each frame a side engine is firing.
301
+ - Is decreased by 0.3 points each frame the main engine is firing.
302
+
303
+ The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively.**
304
+
305
+ An episode is **considered a solution if it scores at least 200 points.**
306
+
307
+ #### Vectorized Environment
308
+
309
+ - We create a vectorized environment (a method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**
310
+ """
311
+
312
+ # Create the environment
313
+ env = make_vec_env('LunarLander-v2', n_envs=16)
314
+
315
+ """## Create the Model 🤖
316
+ - We have studied our environment and we understood the problem: **being able to land the Lunar Lander to the Landing Pad correctly by controlling left, right and main orientation engine**. Now let's build the algorithm we're going to use to solve this Problem 🚀.
317
+
318
+ - To do so, we're going to use our first Deep RL library, [Stable Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/).
319
+
320
+ - SB3 is a set of **reliable implementations of reinforcement learning algorithms in PyTorch**.
321
+
322
+ ---
323
+
324
+ 💡 A good habit when using a new library is to dive first on the documentation: https://stable-baselines3.readthedocs.io/en/master/ and then try some tutorials.
325
+
326
+ ----
327
+
328
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sb3.png" alt="Stable Baselines3">
329
+
330
+ To solve this problem, we're going to use SB3 **PPO**. [PPO (aka Proximal Policy Optimization) is one of the SOTA (state of the art) Deep Reinforcement Learning algorithms that you'll study during this course](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D).
331
+
332
+ PPO is a combination of:
333
+ - *Value-based reinforcement learning method*: learning an action-value function that will tell us the **most valuable action to take given a state and action**.
334
+ - *Policy-based reinforcement learning method*: learning a policy that will **give us a probability distribution over actions**.
335
+
336
+ Stable-Baselines3 is easy to set up:
337
+
338
+ 1️⃣ You **create your environment** (in our case it was done above)
339
+
340
+ 2️⃣ You define the **model you want to use and instantiate this model** `model = PPO("MlpPolicy")`
341
+
342
+ 3️⃣ You **train the agent** with `model.learn` and define the number of training timesteps
343
+
344
+ ```
345
+ # Create environment
346
+ env = gym.make('LunarLander-v2')
347
+
348
+ # Instantiate the agent
349
+ model = PPO('MlpPolicy', env, verbose=1)
350
+ # Train the agent
351
+ model.learn(total_timesteps=int(2e5))
352
+ ```
353
+ """
354
+
355
+ # TODO: Define a PPO MlpPolicy architecture
356
+ # We use MultiLayerPerceptron (MLPPolicy) because the input is a vector,
357
+ # if we had frames as input we would use CnnPolicy
358
+ model = PPO('MlpPolicy',env=env,verbose=1,n_steps=1024,batch_size=64,n_epochs=4,gamma=0.999,gae_lambda=0.98,ent_coef=0.01)
359
+
360
+ """#### Solution"""
361
+
362
+ # SOLUTION
363
+ # We added some parameters to accelerate the training
364
+ model = PPO(
365
+ policy = 'MlpPolicy',
366
+ env = env,
367
+ n_steps = 1024,
368
+ batch_size = 64,
369
+ n_epochs = 4,
370
+ gamma = 0.999,
371
+ gae_lambda = 0.98,
372
+ ent_coef = 0.01,
373
+ verbose=1)
374
+
375
+ """## Train the PPO agent 🏃
376
+ - Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~20min, but you can use fewer timesteps if you just want to try it out.
377
+ - During the training, take a ☕ break you deserved it 🤗
378
+ """
379
+
380
+ # TODO: Train it for 1,000,000 timesteps
381
+ model.learn(total_timesteps=int(1e6))
382
+ # TODO: Specify file name for model and save the model to file
383
+ model_name = "ppo-LunarLander-v2"
384
+ model.save(model_name)
385
+
386
+ """#### Solution"""
387
+
388
+ # SOLUTION
389
+ # Train it for 1,000,000 timesteps
390
+ model.learn(total_timesteps=1000000)
391
+ # Save the model
392
+ model_name = "ppo-LunarLander-v2"
393
+ model.save(model_name)
394
+
395
+ """## Evaluate the agent 📈
396
+ - Remember to wrap the environment in a [Monitor](https://stable-baselines3.readthedocs.io/en/master/common/monitor.html).
397
+ - Now that our Lunar Lander agent is trained 🚀, we need to **check its performance**.
398
+ - Stable-Baselines3 provides a method to do that: `evaluate_policy`.
399
+ - To fill that part you need to [check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading)
400
+ - In the next step, we'll see **how to automatically evaluate and share your agent to compete in a leaderboard, but for now let's do it ourselves**
401
+
402
+
403
+ 💡 When you evaluate your agent, you should not use your training environment but create an evaluation environment.
404
+ """
405
+
406
+ # TODO: Evaluate the agent
407
+ del model # to illustrate loading
408
+ # Load up with Monitor
409
+ eval_env = Monitor(gym.make("LunarLander-v2"))
410
+ # Load the trained agent
411
+ model = PPO.load("ppo-LunarLander-v2")
412
+
413
+ # Evaluate the model with 10 evaluation episodes and deterministic=True
414
+ mean_reward, std_reward = evaluate_policy(model,eval_env,n_eval_episodes=10,deterministic=True)
415
+
416
+ # Print the results
417
+
418
+ print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
419
+
420
+ """#### Solution"""
421
+
422
+ #@title
423
+ eval_env = Monitor(gym.make("LunarLander-v2"))
424
+ mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
425
+ print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
426
+
427
+ """- In my case, I got a mean reward is `200.20 +/- 20.80` after training for 1 million steps, which means that our lunar lander agent is ready to land on the moon 🌛🥳.
428
+
429
+ ## Publish our trained model on the Hub 🔥
430
+ Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
431
+
432
+ 📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20
433
+
434
+ Here's an example of a Model Card (with Space Invaders):
435
+
436
+ By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
437
+
438
+ This way:
439
+ - You can **showcase our work** 🔥
440
+ - You can **visualize your agent playing** 👀
441
+ - You can **share with the community an agent that others can use** 💾
442
+ - You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
443
+
444
+ To be able to share your model with the community there are three more steps to follow:
445
+
446
+ 1️⃣ (If it's not already done) create an account on Hugging Face ➡ https://huggingface.co/join
447
+
448
+ 2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
449
+ - Create a new token (https://huggingface.co/settings/tokens) **with write role**
450
+
451
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
452
+
453
+ - Copy the token
454
+ - Run the cell below and paste the token
455
+ """
456
+
457
+ notebook_login()
458
+ !git config --global credential.helper store
459
+
460
+ """If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
461
+
462
+ 3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
463
+
464
+ Let's fill the `package_to_hub` function:
465
+ - `model`: our trained model.
466
+ - `model_name`: the name of the trained model that we defined in `model_save`
467
+ - `model_architecture`: the model architecture we used, in our case PPO
468
+ - `env_id`: the name of the environment, in our case `LunarLander-v2`
469
+ - `eval_env`: the evaluation environment defined in eval_env
470
+ - `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`
471
+
472
+ 💡 **A good name is {username}/{model_architecture}-{env_id}**
473
+
474
+ - `commit_message`: message of the commit
475
+ """
476
+
477
+ import gymnasium as gym
478
+ from stable_baselines3.common.vec_env import DummyVecEnv
479
+ from stable_baselines3.common.env_util import make_vec_env
480
+
481
+ from huggingface_sb3 import package_to_hub
482
+
483
+ ## TODO: Define a repo_id
484
+ ## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
485
+ repo_id = "articblue/ppo-LunarLander-v2"
486
+
487
+ # TODO: Define the name of the environment
488
+ env_id = "LunarLander-v2"
489
+
490
+ # Create the evaluation env and set the render_mode="rgb_array"
491
+ eval_env = DummyVecEnv([lambda: Monitor(gym.make(env_id, render_mode="rgb_array"))])
492
+
493
+
494
+ # TODO: Define the model architecture we used
495
+ model_architecture = "PPO"
496
+
497
+ ## TODO: Define the commit message
498
+ commit_message = "Lunar Lander v2 simple exercise"
499
+
500
+ # method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
501
+ package_to_hub(model=model, # Our trained model
502
+ model_name=model_name, # The name of our trained model
503
+ model_architecture=model_architecture, # The model architecture we used: in our case PPO
504
+ env_id=env_id, # Name of the environment
505
+ eval_env=eval_env, # Evaluation Environment
506
+ repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
507
+ commit_message=commit_message)
508
+
509
+ """#### Solution
510
+
511
+ """
512
+
513
+ import gymnasium as gym
514
+
515
+ from stable_baselines3 import PPO
516
+ from stable_baselines3.common.vec_env import DummyVecEnv
517
+ from stable_baselines3.common.env_util import make_vec_env
518
+
519
+ from huggingface_sb3 import package_to_hub
520
+
521
+ # PLACE the variables you've just defined two cells above
522
+ # Define the name of the environment
523
+ env_id = "LunarLander-v2"
524
+
525
+ # TODO: Define the model architecture we used
526
+ model_architecture = "PPO"
527
+
528
+ ## Define a repo_id
529
+ ## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
530
+ ## CHANGE WITH YOUR REPO ID
531
+ repo_id = "ThomasSimonini/ppo-LunarLander-v2" # Change with your repo id, you can't push with mine 😄
532
+
533
+ ## Define the commit message
534
+ commit_message = "Upload PPO LunarLander-v2 trained agent"
535
+
536
+ # Create the evaluation env and set the render_mode="rgb_array"
537
+ eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])
538
+
539
+ # PLACE the package_to_hub function you've just filled here
540
+ package_to_hub(model=model, # Our trained model
541
+ model_name=model_name, # The name of our trained model
542
+ model_architecture=model_architecture, # The model architecture we used: in our case PPO
543
+ env_id=env_id, # Name of the environment
544
+ eval_env=eval_env, # Evaluation Environment
545
+ repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
546
+ commit_message=commit_message)
547
+
548
+ """Congrats 🥳 you've just trained and uploaded your first Deep Reinforcement Learning agent. The script above should have displayed a link to a model repository such as https://huggingface.co/osanseviero/test_sb3. When you go to this link, you can:
549
+ * See a video preview of your agent at the right.
550
+ * Click "Files and versions" to see all the files in the repository.
551
+ * Click "Use in stable-baselines3" to get a code snippet that shows how to load the model.
552
+ * A model card (`README.md` file) which gives a description of the model
553
+
554
+ Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
555
+
556
+ Compare the results of your LunarLander-v2 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
557
+
558
+ ## Load a saved LunarLander model from the Hub 🤗
559
+ Thanks to [ironbar](https://github.com/ironbar) for the contribution.
560
+
561
+ Loading a saved model from the Hub is really easy.
562
+
563
+ You go to https://huggingface.co/models?library=stable-baselines3 to see the list of all the Stable-baselines3 saved models.
564
+ 1. You select one and copy its repo_id
565
+
566
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit1/copy-id.png" alt="Copy-id"/>
567
+
568
+ 2. Then we just need to use load_from_hub with:
569
+ - The repo_id
570
+ - The filename: the saved model inside the repo and its extension (*.zip)
571
+
572
+ Because the model I download from the Hub was trained with Gym (the former version of Gymnasium) we need to install shimmy a API conversion tool that will help us to run the environment correctly.
573
+
574
+ Shimmy Documentation: https://github.com/Farama-Foundation/Shimmy
575
+ """
576
+
577
+ !pip install shimmy
578
+
579
+ from huggingface_sb3 import load_from_hub
580
+ repo_id = "Classroom-workshop/assignment2-omar" # The repo_id
581
+ filename = "ppo-LunarLander-v2.zip" # The model filename.zip
582
+
583
+ # When the model was trained on Python 3.8 the pickle protocol is 5
584
+ # But Python 3.6, 3.7 use protocol 4
585
+ # In order to get compatibility we need to:
586
+ # 1. Install pickle5 (we done it at the beginning of the colab)
587
+ # 2. Create a custom empty object we pass as parameter to PPO.load()
588
+ custom_objects = {
589
+ "learning_rate": 0.0,
590
+ "lr_schedule": lambda _: 0.0,
591
+ "clip_range": lambda _: 0.0,
592
+ }
593
+
594
+ checkpoint = load_from_hub(repo_id, filename)
595
+ model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)
596
+
597
+ """Let's evaluate this agent:"""
598
+
599
+ #@title
600
+ eval_env = Monitor(gym.make("LunarLander-v2"))
601
+ mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
602
+ print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
603
+
604
+ """## Some additional challenges 🏆
605
+ The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
606
+
607
+ In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
608
+
609
+ Here are some ideas to achieve so:
610
+ * Train more steps
611
+ * Try different hyperparameters for `PPO`. You can see them at https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters.
612
+ * Check the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) and try another model such as DQN.
613
+ * **Push your new trained model** on the Hub 🔥
614
+
615
+ **Compare the results of your LunarLander-v2 with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) 🏆
616
+
617
+ Is moon landing too boring for you? Try to **change the environment**, why not use MountainCar-v0, CartPole-v1 or CarRacing-v0? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
618
+
619
+ ________________________________________________________________________
620
+ Congrats on finishing this chapter! That was the biggest one, **and there was a lot of information.**
621
+
622
+ If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**
623
+
624
+ Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and have a solid foundations.
625
+
626
+ Naturally, during the course, we’re going to dive deeper into these concepts but **it’s better to have a good understanding of them now before diving into the next chapters.**
627
+
628
+ Next time, in the bonus unit 1, you'll train Huggy the Dog to fetch the stick.
629
+
630
+ <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit1/huggy.jpg" alt="Huggy"/>
631
+
632
+ ## Keep learning, stay awesome 🤗
633
+ """