Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / deep-rl-course /pr_661 /en /unit8 /visualize.md

rtrm

about 2 months ago

preview code

download

raw

5.79 kB

	# Visualize the Clipped Surrogate Objective Function

	Don't worry. It's normal if this seems complex to handle right now. But we're going to see what this Clipped Surrogate Objective Function looks like, and this will help you to visualize better what's going on.

	<figure class="image table text-center m-0 w-full">
	<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
	<figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
	Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
	</figure>

	We have six different situations. Remember first that we take the minimum between the clipped and unclipped objectives.

	## Case 1 and 2: the ratio is between the range

	In situations 1 and 2, the clipping does not apply since the ratio is between the range \\( [1 - \epsilon, 1 + \epsilon] \\)

	In situation 1, we have a positive advantage: the action is better than the average of all the actions in that state. Therefore, we should encourage our current policy to increase the probability of taking that action in that state.

	Since the ratio is between intervals, we can increase our policy's probability of taking that action at that state.

	In situation 2, we have a negative advantage: the action is worse than the average of all actions at that state. Therefore, we should discourage our current policy from taking that action in that state.

	Since the ratio is between intervals, we can decrease the probability that our policy takes that action at that state.

	## Case 3 and 4: the ratio is below the range
	<figure class="image table text-center m-0 w-full">
	<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
	<figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
	Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
	</figure>

	If the probability ratio is lower than \\( [1 - \epsilon] \\), the probability of taking that action at that state is much lower than with the old policy.

	If, like in situation 3, the advantage estimate is positive (A>0), then you want to increase the probability of taking that action at that state.

	But if, like situation 4, the advantage estimate is negative, we don't want to decrease further the probability of taking that action at that state. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights.

	## Case 5 and 6: the ratio is above the range
	<figure class="image table text-center m-0 w-full">
	<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
	<figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
	Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
	</figure>

	If the probability ratio is higher than \\( [1 + \epsilon] \\), the probability of taking that action at that state in the current policy is much higher than in the former policy.

	If, like in situation 5, the advantage is positive, we don't want to get too greedy. We already have a higher probability of taking that action at that state than the former policy. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights.

	If, like in situation 6, the advantage is negative, we want to decrease the probability of taking that action at that state.

	So if we recap, we only update the policy with the unclipped objective part. When the minimum is the clipped objective part, we don't update our policy weights since the gradient will equal 0.

	So we update our policy only if:
	- Our ratio is in the range \\( [1 - \epsilon, 1 + \epsilon] \\)
	- Our ratio is outside the range, but the advantage leads to getting closer to the range
	- Being below the ratio but the advantage is > 0
	- Being above the ratio but the advantage is < 0

	You might wonder why, when the minimum is the clipped ratio, the gradient is 0. When the ratio is clipped, the derivative in this case will not be the derivative of the \\( r_t(\theta) * A_t \\) but the derivative of either \\( (1 - \epsilon)* A_t\\) or the derivative of \\( (1 + \epsilon)* A_t\\) which both = 0.


	To summarize, thanks to this clipped surrogate objective, we restrict the range that the current policy can vary from the old one. Because we remove the incentive for the probability ratio to move outside of the interval since the clip forces the gradient to be zero. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.

	The final Clipped Surrogate Objective Loss for PPO Actor-Critic style looks like this, it's a combination of Clipped Surrogate Objective function, Value Loss Function and Entropy bonus:

	<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-objective.jpg" alt="PPO objective"/>

	That was quite complex. Take time to understand these situations by looking at the table and the graph. You must understand why this makes sense. If you want to go deeper, the best resource is the article ["Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization" by Daniel Bick, especially part 3.4](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf).


	<EditOnGithub source="https://github.com/huggingface/deep-rl-class/blob/main/units/en/unit8/visualize.mdx" />

Xet Storage Details

Size:: 5.79 kB
Xet hash:: a80314868794ba701dc9860222bbc6b65512c508758b0f4a721be595cbd34dcd

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.