Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"The Reinforcement Learning Framework","local":"the-reinforcement-learning-framework","sections":[{"title":"The RL Process","local":"the-rl-process","sections":[],"depth":2},{"title":"The reward hypothesis: the central idea of Reinforcement Learning","local":"reward-hypothesis","sections":[],"depth":2},{"title":"Markov Property","local":"markov-property","sections":[],"depth":2},{"title":"Observations/States Space","local":"obs-space","sections":[],"depth":2},{"title":"Action Space","local":"action-space","sections":[],"depth":2},{"title":"Rewards and the discounting","local":"rewards","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/deep-rl-course/pr_676/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/entry/start.9e28dccc.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/scheduler.cc52f4b9.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/singletons.5af02168.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/index.5033808a.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/paths.e6c20d65.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/entry/app.09b9f016.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/preload-helper.ead8a4db.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/index.caf36c24.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/nodes/0.13b90098.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/nodes/16.b8986075.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/Tip.60c4c4c1.js"> | |
| <link rel="modulepreload" href="/docs/deep-rl-course/pr_676/en/_app/immutable/chunks/MermaidChart.svelte_svelte_type_style_lang.3fce6c88.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"The Reinforcement Learning Framework","local":"the-reinforcement-learning-framework","sections":[{"title":"The RL Process","local":"the-rl-process","sections":[],"depth":2},{"title":"The reward hypothesis: the central idea of Reinforcement Learning","local":"reward-hypothesis","sections":[],"depth":2},{"title":"Markov Property","local":"markov-property","sections":[],"depth":2},{"title":"Observations/States Space","local":"obs-space","sections":[],"depth":2},{"title":"Action Space","local":"action-space","sections":[],"depth":2},{"title":"Rewards and the discounting","local":"rewards","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <div class="items-center shrink-0 min-w-[100px] max-sm:min-w-[50px] justify-end ml-auto flex" style="float: right; margin-left: 10px; display: inline-flex; position: relative; z-index: 10;"><div class="inline-flex rounded-md max-sm:rounded-sm"><button class="inline-flex items-center gap-1 max-sm:gap-0.5 h-6 max-sm:h-5 px-2 max-sm:px-1.5 text-[11px] max-sm:text-[9px] font-medium text-gray-800 border border-r-0 rounded-l-md max-sm:rounded-l-sm border-gray-200 bg-white hover:shadow-inner dark:border-gray-850 dark:bg-gray-950 dark:text-gray-200 dark:hover:bg-gray-800" aria-live="polite"><span class="inline-flex items-center justify-center rounded-md p-0.5 max-sm:p-0"><svg class="w-3 h-3 max-sm:w-2.5 max-sm:h-2.5" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg></span> <span>Copy page</span></button> <button class="inline-flex items-center justify-center w-6 max-sm:w-5 h-6 max-sm:h-5 disabled:pointer-events-none text-sm text-gray-500 hover:text-gray-700 dark:hover:text-white rounded-r-md max-sm:rounded-r-sm border border-l transition border-gray-200 bg-white hover:shadow-inner dark:border-gray-850 dark:bg-gray-950 dark:text-gray-200 dark:hover:bg-gray-800" aria-haspopup="menu" aria-expanded="false" aria-label="Open copy menu"><svg class="transition-transform text-gray-400 overflow-visible w-3 h-3 max-sm:w-2.5 max-sm:h-2.5 rotate-0" width="1em" height="1em" viewBox="0 0 12 7" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M1 1L6 6L11 1" stroke="currentColor"></path></svg></button></div> </div> <h1 class="relative group"><a id="the-reinforcement-learning-framework" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#the-reinforcement-learning-framework"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>The Reinforcement Learning Framework</span></h1> <h2 class="relative group"><a id="the-rl-process" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#the-rl-process"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>The RL Process</span></h2> <figure data-svelte-h="svelte-ddctb4"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%"> <figcaption>The RL Process: a loop of state, action, reward and next state</figcaption> <figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption></figure> <p data-svelte-h="svelte-xrenqd">To understand the RL process, let’s imagine an agent learning to play a platform game:</p> <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%"> <ul><li>Our Agent receives <strong>state <!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>S</mi><mn>0</mn></msub></mrow><annotation encoding="application/x-tex">S_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><!-- HTML_TAG_END --></strong> from the <strong data-svelte-h="svelte-17f3im9">Environment</strong> — we receive the first frame of our game (Environment).</li> <li>Based on that <strong>state<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>S</mi><mn>0</mn></msub></mrow><annotation encoding="application/x-tex">S_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><!-- HTML_TAG_END -->,</strong> the Agent takes <strong>action<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>A</mi><mn>0</mn></msub></mrow><annotation encoding="application/x-tex">A_0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><!-- HTML_TAG_END --></strong> — our Agent will move to the right.</li> <li>The environment goes to a <strong data-svelte-h="svelte-xvdw54">new</strong> <strong>state<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>S</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">S_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><!-- HTML_TAG_END --></strong> — new frame.</li> <li>The environment gives some <strong>reward<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">R_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0077em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><!-- HTML_TAG_END --></strong> to the Agent — we’re not dead <em data-svelte-h="svelte-wepaxr">(Positive Reward +1)</em>.</li></ul> <p data-svelte-h="svelte-1vfvq77">This RL loop outputs a sequence of <strong>state, action, reward and next state.</strong></p> <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="State, Action, Reward, Next State" width="100%"> <p data-svelte-h="svelte-88xzqs">The agent’s goal is to <em>maximize</em> its cumulative reward, <strong>called the expected return.</strong></p> <h2 class="relative group"><a id="reward-hypothesis" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#reward-hypothesis"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>The reward hypothesis: the central idea of Reinforcement Learning</span></h2> <p data-svelte-h="svelte-e5nccy">⇒ Why is the goal of the agent to maximize the expected return?</p> <p data-svelte-h="svelte-1c7xiaj">Because RL is based on the <strong>reward hypothesis</strong>, which is that all goals can be described as the <strong>maximization of the expected return</strong> (expected cumulative reward).</p> <p data-svelte-h="svelte-114zp48">That’s why in Reinforcement Learning, <strong>to have the best behavior,</strong> we aim to learn to take actions that <strong>maximize the expected cumulative reward.</strong></p> <h2 class="relative group"><a id="markov-property" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#markov-property"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Markov Property</span></h2> <p data-svelte-h="svelte-1fhuxz6">In papers, you’ll see that the RL process is called a <strong>Markov Decision Process</strong> (MDP).</p> <p data-svelte-h="svelte-d51hcb">We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it’s this: the Markov Property implies that our agent needs <strong>only the current state to decide</strong> what action to take and <strong>not the history of all the states and actions</strong> they took before.</p> <h2 class="relative group"><a id="obs-space" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#obs-space"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Observations/States Space</span></h2> <p data-svelte-h="svelte-17vvuzr">Observations/States are the <strong>information our agent gets from the environment.</strong> In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.</p> <p data-svelte-h="svelte-1zsvn7">There is a differentiation to make between <em>observation</em> and <em>state</em>, however:</p> <ul data-svelte-h="svelte-mbzk6p"><li><em>State s</em>: is <strong>a complete description of the state of the world</strong> (there is no hidden information). In a fully observed environment.</li></ul> <figure data-svelte-h="svelte-6304o8"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/chess.jpg" alt="Chess"> <figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption></figure> <p data-svelte-h="svelte-izrnku">In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.</p> <ul data-svelte-h="svelte-wf7jww"><li><em>Observation o</em>: is a <strong>partial description of the state.</strong> In a partially observed environment.</li></ul> <figure data-svelte-h="svelte-111fk6r"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario"> <figcaption>In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</figcaption></figure> <p data-svelte-h="svelte-9smsfa">In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</p> <p data-svelte-h="svelte-16dua6v">In Super Mario Bros, we are in a partially observed environment. We receive an observation <strong>since we only see a part of the level.</strong></p> <blockquote class="tip">In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.</blockquote> <p data-svelte-h="svelte-mz2ztc">To recap:</p> <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/obs_space_recap.jpg" alt="Obs space recap" width="100%"> <h2 class="relative group"><a id="action-space" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#action-space"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Action Space</span></h2> <p data-svelte-h="svelte-9ndth8">The Action space is the set of <strong>all possible actions in an environment.</strong></p> <p data-svelte-h="svelte-w9ziai">The actions can come from a <em>discrete</em> or <em>continuous space</em>:</p> <ul data-svelte-h="svelte-3zgz4e"><li><em>Discrete space</em>: the number of possible actions is <strong>finite</strong>.</li></ul> <figure data-svelte-h="svelte-s9r6vk"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario"> <figcaption>In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).</figcaption></figure> <p data-svelte-h="svelte-1nk0x8q">Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.</p> <ul data-svelte-h="svelte-2j77sd"><li><em>Continuous space</em>: the number of possible actions is <strong>infinite</strong>.</li></ul> <figure data-svelte-h="svelte-mwlysu"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/self_driving_car.jpg" alt="Self Driving Car"> <figcaption>A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…</figcaption></figure> <p data-svelte-h="svelte-mz2ztc">To recap:</p> <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/action_space.jpg" alt="Action space recap" width="100%"> <p data-svelte-h="svelte-4py4wk">Taking this information into consideration is crucial because it will <strong>have importance when choosing the RL algorithm in the future.</strong></p> <h2 class="relative group"><a id="rewards" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#rewards"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Rewards and the discounting</span></h2> <p data-svelte-h="svelte-6x9pd5">The reward is fundamental in RL because it’s <strong>the only feedback</strong> for the agent. Thanks to it, our agent knows <strong>if the action taken was good or not.</strong></p> <p data-svelte-h="svelte-ek5gw1">The cumulative reward at each time step <strong>t</strong> can be written as:</p> <figure data-svelte-h="svelte-jouag6"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards"> <figcaption>The cumulative reward equals the sum of all rewards in the sequence.</figcaption></figure> <p data-svelte-h="svelte-37jxso">Which is equivalent to:</p> <figure data-svelte-h="svelte-yujx4f"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_2.jpg" alt="Rewards"> <figcaption>The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...</figcaption></figure> <p data-svelte-h="svelte-1eki0yc">However, in reality, <strong>we can’t just add them like that.</strong> The rewards that come sooner (at the beginning of the game) <strong>are more likely to happen</strong> since they are more predictable than the long-term future reward.</p> <p data-svelte-h="svelte-1isrdpr">Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse’s goal is <strong>to eat the maximum amount of cheese before being eaten by the cat.</strong></p> <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_3.jpg" alt="Rewards" width="100%"> <p data-svelte-h="svelte-aqu0m5">As we can see in the diagram, <strong>it’s more probable to eat the cheese near us than the cheese close to the cat</strong> (the closer we are to the cat, the more dangerous it is).</p> <p data-svelte-h="svelte-1fqpec0">Consequently, <strong>the reward near the cat, even if it is bigger (more cheese), will be more discounted</strong> since we’re not really sure we’ll be able to eat it.</p> <p data-svelte-h="svelte-1x17r6">To discount the rewards, we proceed like this:</p> <ol data-svelte-h="svelte-2aj80g"><li>We define a discount rate called gamma. <strong>It must be between 0 and 1.</strong> Most of the time between <strong>0.95 and 0.99</strong>.</li></ol> <ul data-svelte-h="svelte-11w6wz8"><li>The larger the gamma, the smaller the discount. This means our agent <strong>cares more about the long-term reward.</strong></li> <li>On the other hand, the smaller the gamma, the bigger the discount. This means our <strong>agent cares more about the short term reward (the nearest cheese).</strong></li></ul> <p data-svelte-h="svelte-k7v726">2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, <strong>so the future reward is less and less likely to happen.</strong></p> <p data-svelte-h="svelte-qheyag">Our discounted expected cumulative reward is:</p> <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" alt="Rewards" width="100%"> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/deep-rl-class/blob/main/units/en/unit1/rl-framework.mdx" target="_blank"><svg class="mr-1" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M31,16l-7,7l-1.41-1.41L28.17,16l-5.58-5.59L24,9l7,7z"></path><path d="M1,16l7-7l1.41,1.41L3.83,16l5.58,5.59L8,23l-7-7z"></path><path d="M12.419,25.484L17.639,6.552l1.932,0.518L14.351,26.002z"></path></svg> <span data-svelte-h="svelte-zjs2n5"><span class="underline">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1jdh3wb = { | |
| assets: "/docs/deep-rl-course/pr_676/en", | |
| base: "/docs/deep-rl-course/pr_676/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/deep-rl-course/pr_676/en/_app/immutable/entry/start.9e28dccc.js"), | |
| import("/docs/deep-rl-course/pr_676/en/_app/immutable/entry/app.09b9f016.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 16], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 29.7 kB
- Xet hash:
- ceda8d9636540f5aa2c90180d7a8ad18998d1a1097bd391b1de022ddab5ccf55
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.