Buckets:
| # Two main approaches for solving RL problems [[two-methods]] | |
| Now that we learned the RL framework, how do we solve the RL problem? | |
| In other words, how do we build an RL agent that can **select the actions that maximize its expected cumulative reward?** | |
| ## The Policy π: the agent’s brain [[policy]] | |
| The Policy **π** is the **brain of our Agent**, it’s the function that tells us what **action to take given the state we are in.** So it **defines the agent’s behavior** at a given time. | |
| Think of policy as the brain of our agent, the function that will tell us the action to take given a state | |
| This Policy **is the function we want to learn**, our goal is to find the optimal policy π\*, the policy that **maximizes expected return** when the agent acts according to it. We find this π\* **through training.** | |
| There are two approaches to train our agent to find this optimal policy π\*: | |
| - **Directly,** by teaching the agent to learn which **action to take,** given the current state: **Policy-Based Methods.** | |
| - Indirectly, **teach the agent to learn which state is more valuable** and then take the action that **leads to the more valuable states**: Value-Based Methods. | |
| ## Policy-Based Methods [[policy-based]] | |
| In Policy-Based methods, **we learn a policy function directly.** | |
| This function will define a mapping from each state to the best corresponding action. Alternatively, it could define **a probability distribution over the set of possible actions at that state.** | |
| As we can see here, the policy (deterministic) directly indicates the action to take for each step. | |
| We have two types of policies: | |
| - *Deterministic*: a policy at a given state **will always return the same action.** | |
| action = policy(state) | |
| - *Stochastic*: outputs **a probability distribution over actions.** | |
| policy(actions | state) = probability distribution over the set of actions given the current state | |
| Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state. | |
| If we recap: | |
| ## Value-based methods [[value-based]] | |
| In value-based methods, instead of learning a policy function, we **learn a value function** that maps a state to the expected value **of being at that state.** | |
| The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then acts according to our policy.** | |
| “Act according to our policy” just means that our policy is **“going to the state with the highest value”.** | |
| Here we see that our value function **defined values for each possible state.** | |
| Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal. | |
| Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal. | |
| If we recap: | |
Xet Storage Details
- Size:
- 3 kB
- Xet hash:
- 687616265809b9181f9fc42d737d5cb89e8b2ce9070cd133e43aaf1e0512163b
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.