Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / deep-rl-course /pr_676 /en /unit2 /q-learning-example.md

rtrm

about 1 month ago

preview code

download

raw

2.47 kB

	# A Q-Learning example [[q-learning-example]]

	To better understand Q-Learning, let's take a simple example:

	- You're a mouse in this tiny maze. You always start at the same starting point.
	- The goal is to eat the big pile of cheese at the bottom right-hand corner and avoid the poison. After all, who doesn't like cheese?
	- The episode ends if we eat the poison, eat the big pile of cheese, or if we take more than five steps.
	- The learning rate is 0.1
	- The discount rate (gamma) is 0.99

	The reward function goes like this:

	- +0: Going to a state with no cheese in it.
	- +1: Going to a state with a small cheese in it.
	- +10: Going to the state with the big pile of cheese.
	- -10: Going to the state with the poison and thus dying.
	- +0 If we take more than five steps.

	To train our agent to have an optimal policy (so a policy that goes right, right, down), we will use the Q-Learning algorithm.

	## Step 1: Initialize the Q-table [[step1]]

	So, for now, our Q-table is useless; we need to train our Q-function using the Q-Learning algorithm.

	Let's do it for 2 training timesteps:

	Training timestep 1:

	## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2]]

	Because epsilon is big (= 1.0), I take a random action. In this case, I go right.

	## Step 3: Perform action At, get Rt+1 and St+1 [[step3]]

	By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.

	## Step 4: Update Q(St, At) [[step4]]

	We can now update \\(Q(S_t, A_t)\\) using our formula.

	Training timestep 2:

	## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2-2]]

	I take a random action again, since epsilon=0.99 is big. (Notice we decay epsilon a little bit because, as the training progress, we want less and less exploration).

	I took the action 'down'. This is not a good action since it leads me to the poison.

	## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]]

	Because I ate poison, I get \\(R_{t+1} = -10\\), and I die.

	## Step 4: Update Q(St, At) [[step4-4]]

	Because we're dead, we start a new episode. But what we see here is that, with two explorations steps, my agent became smarter.

	As we continue exploring and exploiting the environment and updating Q-values using the TD target, the Q-table will give us a better and better approximation. At the end of the training, we'll get an estimate of the optimal Q-function.

Xet Storage Details

Size:: 2.47 kB
Xet hash:: 8e1da49dd46bb8793ddf1a5e9db9a3be046a5082adf28bea1d7bc58914dee2b6

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.