Buckets:
| # Visualize the Clipped Surrogate Objective Function | |
| Don't worry. **It's normal if this seems complex to handle right now**. But we're going to see what this Clipped Surrogate Objective Function looks like, and this will help you to visualize better what's going on. | |
| Table from "Towards Delivering a Coherent Self-Contained | |
| Explanation of Proximal Policy Optimization" by Daniel Bick | |
| We have six different situations. Remember first that we take the minimum between the clipped and unclipped objectives. | |
| ## Case 1 and 2: the ratio is between the range | |
| In situations 1 and 2, **the clipping does not apply since the ratio is between the range** \\( [1 - \epsilon, 1 + \epsilon] \\) | |
| In situation 1, we have a positive advantage: the **action is better than the average** of all the actions in that state. Therefore, we should encourage our current policy to increase the probability of taking that action in that state. | |
| Since the ratio is between intervals, **we can increase our policy's probability of taking that action at that state.** | |
| In situation 2, we have a negative advantage: the action is worse than the average of all actions at that state. Therefore, we should discourage our current policy from taking that action in that state. | |
| Since the ratio is between intervals, **we can decrease the probability that our policy takes that action at that state.** | |
| ## Case 3 and 4: the ratio is below the range | |
| Table from "Towards Delivering a Coherent Self-Contained | |
| Explanation of Proximal Policy Optimization" by Daniel Bick | |
| If the probability ratio is lower than \\( [1 - \epsilon] \\), the probability of taking that action at that state is much lower than with the old policy. | |
| If, like in situation 3, the advantage estimate is positive (A>0), then **you want to increase the probability of taking that action at that state.** | |
| But if, like situation 4, the advantage estimate is negative, **we don't want to decrease further** the probability of taking that action at that state. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights. | |
| ## Case 5 and 6: the ratio is above the range | |
| Table from "Towards Delivering a Coherent Self-Contained | |
| Explanation of Proximal Policy Optimization" by Daniel Bick | |
| If the probability ratio is higher than \\( [1 + \epsilon] \\), the probability of taking that action at that state in the current policy is **much higher than in the former policy.** | |
| If, like in situation 5, the advantage is positive, **we don't want to get too greedy**. We already have a higher probability of taking that action at that state than the former policy. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights. | |
| If, like in situation 6, the advantage is negative, we want to decrease the probability of taking that action at that state. | |
| So if we recap, **we only update the policy with the unclipped objective part**. When the minimum is the clipped objective part, we don't update our policy weights since the gradient will equal 0. | |
| So we update our policy only if: | |
| - Our ratio is in the range \\( [1 - \epsilon, 1 + \epsilon] \\) | |
| - Our ratio is outside the range, but **the advantage leads to getting closer to the range** | |
| - Being below the ratio but the advantage is > 0 | |
| - Being above the ratio but the advantage is \\( 1 + \epsilon \\) or | |
| That was quite complex. Take time to understand these situations by looking at the table and the graph. **You must understand why this makes sense.** If you want to go deeper, the best resource is the article ["Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization" by Daniel Bick, especially part 3.4](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf). | |
Xet Storage Details
- Size:
- 3.8 kB
- Xet hash:
- 031da3e6941c5127dae3a7c588a71debada2d71cdc6ba96101e9b7504b9f229e
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.