Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / deep-rl-course /pr_692 /en /unit1 /two-methods.md

HuggingFaceDocBuilder

about 1 month ago

preview code

download

raw

3 kB

	# Two main approaches for solving RL problems [[two-methods]]

	Now that we learned the RL framework, how do we solve the RL problem?

	In other words, how do we build an RL agent that can select the actions that maximize its expected cumulative reward?

	## The Policy π: the agent’s brain [[policy]]

	The Policy π is the brain of our Agent, it’s the function that tells us what action to take given the state we are in. So it defines the agent’s behavior at a given time.

	Think of policy as the brain of our agent, the function that will tell us the action to take given a state

	This Policy is the function we want to learn, our goal is to find the optimal policy π\, the policy that maximizes expected return* when the agent acts according to it. We find this π\* through training.

	There are two approaches to train our agent to find this optimal policy π\*:

	- Directly, by teaching the agent to learn which action to take, given the current state: Policy-Based Methods.
	- Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.

	## Policy-Based Methods [[policy-based]]

	In Policy-Based methods, we learn a policy function directly.

	This function will define a mapping from each state to the best corresponding action. Alternatively, it could define a probability distribution over the set of possible actions at that state.

	As we can see here, the policy (deterministic) directly indicates the action to take for each step.

	We have two types of policies:

	- Deterministic: a policy at a given state will always return the same action.

	action = policy(state)

	- Stochastic: outputs a probability distribution over actions.

	policy(actions \| state) = probability distribution over the set of actions given the current state

	Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.

	If we recap:

	## Value-based methods [[value-based]]

	In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of being at that state.

	The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.

	“Act according to our policy” just means that our policy is “going to the state with the highest value”.

	Here we see that our value function defined values for each possible state.

	Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

	Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

	If we recap:

Xet Storage Details

Size:: 3 kB
Xet hash:: 687616265809b9181f9fc42d737d5cb89e8b2ce9070cd133e43aaf1e0512163b

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.