Buckets:
| # The Reinforcement Learning Framework [[the-reinforcement-learning-framework]] | |
| ## The RL Process [[the-rl-process]] | |
| The RL Process: a loop of state, action, reward and next state | |
| Source: Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto | |
| To understand the RL process, let’s imagine an agent learning to play a platform game: | |
| - Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment). | |
| - Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right. | |
| - The environment goes to a **new** **state \\(S_1\\)** — new frame. | |
| - The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*. | |
| This RL loop outputs a sequence of **state, action, reward and next state.** | |
| The agent's goal is to _maximize_ its cumulative reward, **called the expected return.** | |
| ## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]] | |
| ⇒ Why is the goal of the agent to maximize the expected return? | |
| Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward). | |
| That’s why in Reinforcement Learning, **to have the best behavior,** we aim to learn to take actions that **maximize the expected cumulative reward.** | |
| ## Markov Property [[markov-property]] | |
| In papers, you’ll see that the RL process is called a **Markov Decision Process** (MDP). | |
| We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before. | |
| ## Observations/States Space [[obs-space]] | |
| Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc. | |
| There is a differentiation to make between *observation* and *state*, however: | |
| - *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment. | |
| In chess game, we receive a state from the environment since we have access to the whole check board information. | |
| In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed. | |
| - *Observation o*: is a **partial description of the state.** In a partially observed environment. | |
| In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation. | |
| In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation. | |
| In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.** | |
| In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations. | |
| To recap: | |
| ## Action Space [[action-space]] | |
| The Action space is the set of **all possible actions in an environment.** | |
| The actions can come from a *discrete* or *continuous space*: | |
| - *Discrete space*: the number of possible actions is **finite**. | |
| In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching). | |
| Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions. | |
| - *Continuous space*: the number of possible actions is **infinite**. | |
| A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°… | |
| To recap: | |
| Taking this information into consideration is crucial because it will **have importance when choosing the RL algorithm in the future.** | |
| ## Rewards and the discounting [[rewards]] | |
| The reward is fundamental in RL because it’s **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not.** | |
| The cumulative reward at each time step **t** can be written as: | |
| The cumulative reward equals the sum of all rewards in the sequence. | |
| Which is equivalent to: | |
| The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ... | |
| However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward. | |
| Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is **to eat the maximum amount of cheese before being eaten by the cat.** | |
| As we can see in the diagram, **it’s more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is). | |
| Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we’re not really sure we’ll be able to eat it. | |
| To discount the rewards, we proceed like this: | |
| 1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.95 and 0.99**. | |
| - The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.** | |
| - On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).** | |
| 2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.** | |
| Our discounted expected cumulative reward is: | |
Xet Storage Details
- Size:
- 5.98 kB
- Xet hash:
- 37aaa8c98ebcd65b507bb3703ca9c246ee1875c9597334f12b34c163221beb9c
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.