| {% extends "layout.html" %}
|
|
|
| {% block content %}
|
| <!DOCTYPE html>
|
| <html lang="en">
|
| <head>
|
| <meta charset="UTF-8">
|
| <meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| <title>Study Guide: RL Action & Policy</title>
|
|
|
| <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
|
| <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
|
| <style>
|
|
|
| body {
|
| background-color: #ffffff;
|
| color: #000000;
|
| font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
|
| font-weight: normal;
|
| line-height: 1.8;
|
| margin: 0;
|
| padding: 20px;
|
| }
|
|
|
|
|
| .container {
|
| max-width: 800px;
|
| margin: 0 auto;
|
| padding: 20px;
|
| }
|
|
|
|
|
| h1, h2, h3 {
|
| color: #000000;
|
| border: none;
|
| font-weight: bold;
|
| }
|
|
|
| h1 {
|
| text-align: center;
|
| border-bottom: 3px solid #000;
|
| padding-bottom: 10px;
|
| margin-bottom: 30px;
|
| font-size: 2.5em;
|
| }
|
|
|
| h2 {
|
| font-size: 1.8em;
|
| margin-top: 40px;
|
| border-bottom: 1px solid #ddd;
|
| padding-bottom: 8px;
|
| }
|
|
|
| h3 {
|
| font-size: 1.3em;
|
| margin-top: 25px;
|
| }
|
|
|
|
|
| strong {
|
| font-weight: 900;
|
| }
|
|
|
|
|
| p, li {
|
| font-size: 1.1em;
|
| border-bottom: 1px solid #e0e0e0;
|
| padding-bottom: 10px;
|
| margin-bottom: 10px;
|
| }
|
|
|
|
|
| li:last-child {
|
| border-bottom: none;
|
| }
|
|
|
|
|
| ol {
|
| list-style-type: decimal;
|
| padding-left: 20px;
|
| }
|
|
|
| ol li {
|
| padding-left: 10px;
|
| }
|
|
|
|
|
| ul {
|
| list-style-type: none;
|
| padding-left: 0;
|
| }
|
|
|
| ul li::before {
|
| content: "•";
|
| color: #000;
|
| font-weight: bold;
|
| display: inline-block;
|
| width: 1em;
|
| margin-left: 0;
|
| }
|
|
|
|
|
| pre {
|
| background-color: #f4f4f4;
|
| border: 1px solid #ddd;
|
| border-radius: 5px;
|
| padding: 15px;
|
| white-space: pre-wrap;
|
| word-wrap: break-word;
|
| font-family: "Courier New", Courier, monospace;
|
| font-size: 0.95em;
|
| font-weight: normal;
|
| color: #333;
|
| border-bottom: none;
|
| }
|
|
|
|
|
| .story-rl {
|
| background-color: #fef2f2;
|
| border-left: 4px solid #dc3545;
|
| margin: 15px 0;
|
| padding: 10px 15px;
|
| font-style: italic;
|
| color: #555;
|
| font-weight: normal;
|
| border-bottom: none;
|
| }
|
|
|
| .story-rl p, .story-rl li {
|
| border-bottom: none;
|
| }
|
|
|
| .example-rl {
|
| background-color: #fef7f7;
|
| padding: 15px;
|
| margin: 15px 0;
|
| border-radius: 5px;
|
| border-left: 4px solid #f17c87;
|
| }
|
|
|
| .example-rl p, .example-rl li {
|
| border-bottom: none !important;
|
| }
|
|
|
|
|
| .quiz-section {
|
| background-color: #fafafa;
|
| border: 1px solid #ddd;
|
| border-radius: 5px;
|
| padding: 20px;
|
| margin-top: 30px;
|
| }
|
| .quiz-answers {
|
| background-color: #fef7f7;
|
| padding: 15px;
|
| margin-top: 15px;
|
| border-radius: 5px;
|
| }
|
|
|
|
|
| table {
|
| width: 100%;
|
| border-collapse: collapse;
|
| margin: 25px 0;
|
| }
|
| th, td {
|
| border: 1px solid #ddd;
|
| padding: 12px;
|
| text-align: left;
|
| }
|
| th {
|
| background-color: #f2f2f2;
|
| font-weight: bold;
|
| }
|
|
|
|
|
| @media (max-width: 768px) {
|
| body, .container {
|
| padding: 10px;
|
| }
|
| h1 { font-size: 2em; }
|
| h2 { font-size: 1.5em; }
|
| h3 { font-size: 1.2em; }
|
| p, li { font-size: 1em; }
|
| pre { font-size: 0.85em; }
|
| table, th, td { font-size: 0.9em; }
|
| }
|
| </style>
|
| </head>
|
| <body>
|
|
|
| <div class="container">
|
| <h1>🧠 Study Guide: Action & Policy in Reinforcement Learning</h1>
|
|
|
| <h2>🔹 1. Introduction</h2>
|
| <div class="story-rl">
|
| <p><strong>Story-style intuition: The Video Game Character</strong></p>
|
| <p>Think of a character in a video game. At any moment, the character has a set of possible moves they can make—jump, run, duck, attack. These are the character's <strong>Actions</strong>. The player controlling the character has a strategy in their head: "If a monster is close, I should attack. If there's a pit, I should jump." This strategy, this set of rules that dictates which action to take in any situation, is the <strong>Policy</strong>. In Reinforcement Learning, our goal is to teach the agent (the character) to learn the best possible policy on its own to win the game (maximize rewards).</p>
|
| </div>
|
| <p>In the world of RL, the <strong>Action</strong> is the "what" (what the agent does) and the <strong>Policy</strong> is the "how" (how the agent decides what to do). Together, they form the core of the agent's behavior.</p>
|
|
|
| <h2>🔹 2. Action (A)</h2>
|
| <p>An <strong>Action</strong> is one of the possible moves an agent can make in a given state. The set of all possible actions in a state is called the <strong>action space</strong>.</p>
|
| <h3>Types of Action Spaces:</h3>
|
| <ul>
|
| <li><strong>Discrete Actions:</strong> There is a finite, limited set of distinct actions the agent can choose from.
|
| <div class="example-rl"><p><strong>Example:</strong> In a maze, the actions are {Up, Down, Left, Right}. In a game of tic-tac-toe, the actions are placing your mark in one of the empty squares.</p></div>
|
| </li>
|
| <li><strong>Continuous Actions:</strong> The actions are described by real-valued numbers within a certain range.
|
| <div class="example-rl"><p><strong>Example:</strong> For a self-driving car, the action of steering can be any angle between -45.0 and +45.0 degrees. For a thermostat, the action is setting a temperature, which can be any value like 20.5°C.</p></div>
|
| </li>
|
| </ul>
|
| <p>The set of available actions can depend on the current state, denoted as \( A(s) \).</p>
|
|
|
| <h2>🔹 3. Policy (π)</h2>
|
| <p>A <strong>Policy</strong> is the agent's strategy or "brain." It is a rule that maps a state to an action. The ultimate goal of RL is to find an <strong>optimal policy</strong>—a policy that maximizes the total expected reward over time.</p>
|
| <p>Mathematically, a policy is a distribution over actions given a state: \( \pi(a|s) = P(A_t = a \mid S_t = s) \)</p>
|
| <h3>Types of Policies:</h3>
|
| <ul>
|
| <li><strong>Deterministic Policy:</strong> The policy always outputs the same action for a given state. There is no randomness.
|
| <div class="example-rl"><p><strong>Story Example:</strong> A self-driving car's policy is deterministic: "If the traffic light state is 'Red', the action is always 'Brake'." There is no chance it will do something else.</p></div>
|
| <p>Formula: \( a = \pi(s) \)</p>
|
| </li>
|
| <li><strong>Stochastic Policy:</strong> The policy outputs a probability distribution over actions for a given state. The agent then samples from this distribution to choose its next action.
|
| <div class="example-rl"><p><strong>Story Example:</strong> A poker-playing bot might have a stochastic policy. In a certain state, its policy might be: "70% chance of 'Raising', 30% chance of 'Folding'." This randomness makes the agent's behavior less predictable to opponents and is crucial for exploration.</p></div>
|
| <p>Formula: \( a \sim \pi(\cdot|s) \)</p>
|
| </li>
|
| </ul>
|
|
|
| <h2>🔹 4. Policy vs. Value Function</h2>
|
| <p>It's crucial to distinguish between a policy and a value function, as they work together to guide the agent.</p>
|
|
|
| <ul>
|
| <li><strong>Policy (The "How-To" Guide):</strong> The policy tells you <strong>what to do</strong> in a state.
|
| <div class="example-rl"><p><strong>Example:</strong> "You are at a crossroads. The policy says: Turn Left."</p></div>
|
| </li>
|
| <li><strong>Value Function (The "Evaluation Map"):</strong> The value function tells you <strong>how good it is</strong> to be in a certain state or to take a certain action in a state.
|
| <div class="example-rl"><p><strong>Example:</strong> "You are at a crossroads. The value function tells you: The path to the left has a high value because it leads to treasure. The path to the right has a low value because it leads to a dragon."</p></div>
|
| </li>
|
| </ul>
|
| <p>Modern RL algorithms often learn both. They use the value function to evaluate how good their actions are, which in turn helps them improve their policy.</p>
|
|
|
| <h2>🔹 5. Interaction Flow with Action & Policy</h2>
|
| <p>The Action and Policy are at the heart of the agent's decision-making in the RL loop.</p>
|
| <ol>
|
| <li><strong>Agent observes state (s):</strong> "I am at a crossroad."</li>
|
| <li><strong>Agent follows its policy (π) to choose an action (a):</strong> "My policy tells me to go left."</li>
|
| <li><strong>Environment transitions and gives reward (r):</strong> The agent moves left, finds a gold coin (+10 reward), and arrives at a new state.</li>
|
| <li><strong>Agent improves its policy:</strong> The agent thinks, "That was a great outcome! My policy was right to tell me to go left from that crossroad. I should strengthen that rule."</li>
|
| </ol>
|
|
|
| <h2>🔹 6. Detailed Examples</h2>
|
| <div class="example-rl">
|
| <h3>Example 1: Chess</h3>
|
| <ul>
|
| <li><strong>Actions:</strong> The set of all legal moves for the current player's pieces (e.g., move pawn e2 to e4, move knight g1 to f3). The action space changes with every state.</li>
|
| <li><strong>Policy:</strong> A very complex strategy. A simple policy might be a set of human-written rules: "If my king is in check, my first priority is to move out of check." An advanced policy (like AlphaGo's) is a deep neural network that takes the board state as input and outputs a probability for every possible move.</li>
|
| </ul>
|
| </div>
|
| <div class="example-rl">
|
| <h3>Example 2: Self-Driving Car</h3>
|
| <ul>
|
| <li><strong>Actions:</strong> A continuous action space, often represented as a vector: `[steering_angle, acceleration, braking]`. For example, `[-5.2, 0.8, 0.0]` means steer 5.2 degrees left, accelerate at 80%, and don't brake.</li>
|
| <li><strong>Policy:</strong> A highly sophisticated function that takes sensor data (camera, LiDAR) as input and outputs the continuous action vector. A simple part of the policy might be: "If the distance to the car in front is less than 10 meters and decreasing, the braking component of my action vector should be high."</li>
|
| </ul>
|
| </div>
|
|
|
| <h2>🔹 7. Challenges</h2>
|
| <ul>
|
| <li><strong>Huge Action Spaces:</strong>
|
| <div class="example-rl"><p><strong>Example:</strong> In a real-time strategy game like StarCraft, an action could be commanding any one of hundreds of units to do any one of a dozen things, leading to millions of possible actions at any moment.</p></div>
|
| </li>
|
| <li><strong>Designing Effective Policies (Exploration):</strong> How do you design a policy that not only exploits what it knows but also explores new actions to discover better strategies? This is the exploration-exploitation dilemma.</li>
|
| <li><strong>Learning Stable Policies:</strong> In complex, dynamic environments, the feedback from actions can be noisy and delayed, making it very difficult for the policy to learn stable and reliable behaviors.</li>
|
| </ul>
|
|
|
| <div class="quiz-section">
|
| <h2>📝 Quick Quiz: Test Your Knowledge</h2>
|
| <ol>
|
| <li><strong>What is the difference between a discrete and a continuous action space? Give an example of each.</strong></li>
|
| <li><strong>What is the difference between a deterministic and a stochastic policy? When might a stochastic policy be useful?</strong></li>
|
| <li><strong>Can an agent have a good policy without knowing the value function?</strong></li>
|
| </ol>
|
| <div class="quiz-answers">
|
| <h3>Answers</h3>
|
| <p><strong>1.</strong> A <strong>discrete</strong> action space has a finite number of distinct options (e.g., move left/right). A <strong>continuous</strong> action space has actions represented by real numbers in a range (e.g., turning a steering wheel by 15.7 degrees).</p>
|
| <p><strong>2.</strong> A <strong>deterministic</strong> policy always chooses the same action for a state. A <strong>stochastic</strong> policy outputs a probability distribution over actions. A stochastic policy is very useful for <strong>exploration</strong> (trying new things) and for games where unpredictability is an advantage (like poker).</p>
|
| <p><strong>3.</strong> Yes, but it's harder. Some algorithms, called "policy-gradient" methods, can directly search for a good policy without learning a value function. However, many of the most successful modern algorithms learn both, using the value function to help guide improvements to the policy.</p>
|
| </div>
|
| </div>
|
|
|
| </div>
|
|
|
| </body>
|
| </html>
|
| {% endblock %}
|
|
|