Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / deep-rl-course /pr_676 /en /unit1 /rl-framework.md

rtrm

about 2 months ago

preview code

download

raw

5.98 kB

	# The Reinforcement Learning Framework [[the-reinforcement-learning-framework]]

	## The RL Process [[the-rl-process]]

	The RL Process: a loop of state, action, reward and next state
	Source: Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto

	To understand the RL process, let’s imagine an agent learning to play a platform game:

	- Our Agent receives state \\(S_0\\) from the Environment — we receive the first frame of our game (Environment).
	- Based on that state \\(S_0\\), the Agent takes action \\(A_0\\) — our Agent will move to the right.
	- The environment goes to a new state \\(S_1\\) — new frame.
	- The environment gives some reward \\(R_1\\) to the Agent — we’re not dead (Positive Reward +1).

	This RL loop outputs a sequence of state, action, reward and next state.

	The agent's goal is to _maximize_ its cumulative reward, called the expected return.

	## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]

	⇒ Why is the goal of the agent to maximize the expected return?

	Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

	That’s why in Reinforcement Learning, to have the best behavior, we aim to learn to take actions that maximize the expected cumulative reward.

	## Markov Property [[markov-property]]

	In papers, you’ll see that the RL process is called a Markov Decision Process (MDP).

	We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.

	## Observations/States Space [[obs-space]]

	Observations/States are the information our agent gets from the environment. In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.

	There is a differentiation to make between observation and state, however:

	- State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.

	In chess game, we receive a state from the environment since we have access to the whole check board information.

	In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.

	- Observation o: is a partial description of the state. In a partially observed environment.

	In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.

	In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.

	In Super Mario Bros, we are in a partially observed environment. We receive an observation since we only see a part of the level.

	In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.

	To recap:

	## Action Space [[action-space]]

	The Action space is the set of all possible actions in an environment.

	The actions can come from a discrete or continuous space:

	- Discrete space: the number of possible actions is finite.

	In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).

	Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.

	- Continuous space: the number of possible actions is infinite.

	A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…

	To recap:

	Taking this information into consideration is crucial because it will have importance when choosing the RL algorithm in the future.

	## Rewards and the discounting [[rewards]]

	The reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.

	The cumulative reward at each time step t can be written as:

	The cumulative reward equals the sum of all rewards in the sequence.

	Which is equivalent to:

	The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...

	However, in reality, we can’t just add them like that. The rewards that come sooner (at the beginning of the game) are more likely to happen since they are more predictable than the long-term future reward.

	Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is to eat the maximum amount of cheese before being eaten by the cat.

	As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).

	Consequently, the reward near the cat, even if it is bigger (more cheese), will be more discounted since we’re not really sure we’ll be able to eat it.

	To discount the rewards, we proceed like this:

	1. We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99.
	- The larger the gamma, the smaller the discount. This means our agent cares more about the long-term reward.
	- On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).

	2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen.

	Our discounted expected cumulative reward is:

Xet Storage Details

Size:: 5.98 kB
Xet hash:: 37aaa8c98ebcd65b507bb3703ca9c246ee1875c9597334f12b34c163221beb9c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

	# The Reinforcement Learning Framework [[the-reinforcement-learning-framework]]

	## The RL Process [[the-rl-process]]

	The RL Process: a loop of state, action, reward and next state
	Source: Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto

	To understand the RL process, let’s imagine an agent learning to play a platform game:

	- Our Agent receives state \\(S_0\\) from the Environment — we receive the first frame of our game (Environment).
	- Based on that state \\(S_0\\), the Agent takes action \\(A_0\\) — our Agent will move to the right.
	- The environment goes to a new state \\(S_1\\) — new frame.
	- The environment gives some reward \\(R_1\\) to the Agent — we’re not dead (Positive Reward +1).

	This RL loop outputs a sequence of state, action, reward and next state.

	The agent's goal is to _maximize_ its cumulative reward, called the expected return.

	## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]

	⇒ Why is the goal of the agent to maximize the expected return?

	Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

	That’s why in Reinforcement Learning, to have the best behavior, we aim to learn to take actions that maximize the expected cumulative reward.

	## Markov Property [[markov-property]]

	In papers, you’ll see that the RL process is called a Markov Decision Process (MDP).

	We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.

	## Observations/States Space [[obs-space]]

	Observations/States are the information our agent gets from the environment. In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.

	There is a differentiation to make between observation and state, however:

	- State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.

	In chess game, we receive a state from the environment since we have access to the whole check board information.

	In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.

	- Observation o: is a partial description of the state. In a partially observed environment.

	In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.

	In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.

	In Super Mario Bros, we are in a partially observed environment. We receive an observation since we only see a part of the level.

	In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.

	To recap:

	## Action Space [[action-space]]

	The Action space is the set of all possible actions in an environment.

	The actions can come from a discrete or continuous space:

	- Discrete space: the number of possible actions is finite.

	In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).

	Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.

	- Continuous space: the number of possible actions is infinite.

	A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…

	To recap:

	Taking this information into consideration is crucial because it will have importance when choosing the RL algorithm in the future.

	## Rewards and the discounting [[rewards]]

	The reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.

	The cumulative reward at each time step t can be written as:

	The cumulative reward equals the sum of all rewards in the sequence.

	Which is equivalent to:

	The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...

	However, in reality, we can’t just add them like that. The rewards that come sooner (at the beginning of the game) are more likely to happen since they are more predictable than the long-term future reward.

	Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is to eat the maximum amount of cheese before being eaten by the cat.

	As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).

	Consequently, the reward near the cat, even if it is bigger (more cheese), will be more discounted since we’re not really sure we’ll be able to eat it.

	To discount the rewards, we proceed like this:

	1. We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99.
	- The larger the gamma, the smaller the discount. This means our agent cares more about the long-term reward.
	- On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).

	2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen.

	Our discounted expected cumulative reward is: