File size: 15,278 Bytes
d0a6b4f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | {% extends "layout.html" %}
{% block content %}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Study Guide: RL Action & Policy</title>
<!-- MathJax for rendering mathematical formulas -->
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<style>
/* General Body Styles */
body {
background-color: #ffffff; /* White background */
color: #000000; /* Black text */
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
font-weight: normal;
line-height: 1.8;
margin: 0;
padding: 20px;
}
/* Container for centering content */
.container {
max-width: 800px;
margin: 0 auto;
padding: 20px;
}
/* Headings */
h1, h2, h3 {
color: #000000;
border: none;
font-weight: bold;
}
h1 {
text-align: center;
border-bottom: 3px solid #000;
padding-bottom: 10px;
margin-bottom: 30px;
font-size: 2.5em;
}
h2 {
font-size: 1.8em;
margin-top: 40px;
border-bottom: 1px solid #ddd;
padding-bottom: 8px;
}
h3 {
font-size: 1.3em;
margin-top: 25px;
}
/* Main words are even bolder */
strong {
font-weight: 900;
}
/* Paragraphs and List Items with a line below */
p, li {
font-size: 1.1em;
border-bottom: 1px solid #e0e0e0; /* Light gray line below each item */
padding-bottom: 10px; /* Space between text and the line */
margin-bottom: 10px; /* Space below the line */
}
/* Remove bottom border from the last item in a list for cleaner look */
li:last-child {
border-bottom: none;
}
/* Ordered lists */
ol {
list-style-type: decimal;
padding-left: 20px;
}
ol li {
padding-left: 10px;
}
/* Unordered Lists */
ul {
list-style-type: none;
padding-left: 0;
}
ul li::before {
content: "•";
color: #000;
font-weight: bold;
display: inline-block;
width: 1em;
margin-left: 0;
}
/* Code block styling */
pre {
background-color: #f4f4f4;
border: 1px solid #ddd;
border-radius: 5px;
padding: 15px;
white-space: pre-wrap;
word-wrap: break-word;
font-family: "Courier New", Courier, monospace;
font-size: 0.95em;
font-weight: normal;
color: #333;
border-bottom: none;
}
/* RL Specific Styling */
.story-rl {
background-color: #fef2f2;
border-left: 4px solid #dc3545; /* Red accent */
margin: 15px 0;
padding: 10px 15px;
font-style: italic;
color: #555;
font-weight: normal;
border-bottom: none;
}
.story-rl p, .story-rl li {
border-bottom: none;
}
.example-rl {
background-color: #fef7f7;
padding: 15px;
margin: 15px 0;
border-radius: 5px;
border-left: 4px solid #f17c87; /* Lighter Red accent */
}
.example-rl p, .example-rl li {
border-bottom: none !important;
}
/* Quiz Styling */
.quiz-section {
background-color: #fafafa;
border: 1px solid #ddd;
border-radius: 5px;
padding: 20px;
margin-top: 30px;
}
.quiz-answers {
background-color: #fef7f7;
padding: 15px;
margin-top: 15px;
border-radius: 5px;
}
/* Table Styling */
table {
width: 100%;
border-collapse: collapse;
margin: 25px 0;
}
th, td {
border: 1px solid #ddd;
padding: 12px;
text-align: left;
}
th {
background-color: #f2f2f2;
font-weight: bold;
}
/* --- Mobile Responsive Styles --- */
@media (max-width: 768px) {
body, .container {
padding: 10px;
}
h1 { font-size: 2em; }
h2 { font-size: 1.5em; }
h3 { font-size: 1.2em; }
p, li { font-size: 1em; }
pre { font-size: 0.85em; }
table, th, td { font-size: 0.9em; }
}
</style>
</head>
<body>
<div class="container">
<h1>🧠 Study Guide: Action & Policy in Reinforcement Learning</h1>
<h2>🔹 1. Introduction</h2>
<div class="story-rl">
<p><strong>Story-style intuition: The Video Game Character</strong></p>
<p>Think of a character in a video game. At any moment, the character has a set of possible moves they can make—jump, run, duck, attack. These are the character's <strong>Actions</strong>. The player controlling the character has a strategy in their head: "If a monster is close, I should attack. If there's a pit, I should jump." This strategy, this set of rules that dictates which action to take in any situation, is the <strong>Policy</strong>. In Reinforcement Learning, our goal is to teach the agent (the character) to learn the best possible policy on its own to win the game (maximize rewards).</p>
</div>
<p>In the world of RL, the <strong>Action</strong> is the "what" (what the agent does) and the <strong>Policy</strong> is the "how" (how the agent decides what to do). Together, they form the core of the agent's behavior.</p>
<h2>🔹 2. Action (A)</h2>
<p>An <strong>Action</strong> is one of the possible moves an agent can make in a given state. The set of all possible actions in a state is called the <strong>action space</strong>.</p>
<h3>Types of Action Spaces:</h3>
<ul>
<li><strong>Discrete Actions:</strong> There is a finite, limited set of distinct actions the agent can choose from.
<div class="example-rl"><p><strong>Example:</strong> In a maze, the actions are {Up, Down, Left, Right}. In a game of tic-tac-toe, the actions are placing your mark in one of the empty squares.</p></div>
</li>
<li><strong>Continuous Actions:</strong> The actions are described by real-valued numbers within a certain range.
<div class="example-rl"><p><strong>Example:</strong> For a self-driving car, the action of steering can be any angle between -45.0 and +45.0 degrees. For a thermostat, the action is setting a temperature, which can be any value like 20.5°C.</p></div>
</li>
</ul>
<p>The set of available actions can depend on the current state, denoted as \( A(s) \).</p>
<h2>🔹 3. Policy (π)</h2>
<p>A <strong>Policy</strong> is the agent's strategy or "brain." It is a rule that maps a state to an action. The ultimate goal of RL is to find an <strong>optimal policy</strong>—a policy that maximizes the total expected reward over time.</p>
<p>Mathematically, a policy is a distribution over actions given a state: \( \pi(a|s) = P(A_t = a \mid S_t = s) \)</p>
<h3>Types of Policies:</h3>
<ul>
<li><strong>Deterministic Policy:</strong> The policy always outputs the same action for a given state. There is no randomness.
<div class="example-rl"><p><strong>Story Example:</strong> A self-driving car's policy is deterministic: "If the traffic light state is 'Red', the action is always 'Brake'." There is no chance it will do something else.</p></div>
<p>Formula: \( a = \pi(s) \)</p>
</li>
<li><strong>Stochastic Policy:</strong> The policy outputs a probability distribution over actions for a given state. The agent then samples from this distribution to choose its next action.
<div class="example-rl"><p><strong>Story Example:</strong> A poker-playing bot might have a stochastic policy. In a certain state, its policy might be: "70% chance of 'Raising', 30% chance of 'Folding'." This randomness makes the agent's behavior less predictable to opponents and is crucial for exploration.</p></div>
<p>Formula: \( a \sim \pi(\cdot|s) \)</p>
</li>
</ul>
<h2>🔹 4. Policy vs. Value Function</h2>
<p>It's crucial to distinguish between a policy and a value function, as they work together to guide the agent.</p>
<ul>
<li><strong>Policy (The "How-To" Guide):</strong> The policy tells you <strong>what to do</strong> in a state.
<div class="example-rl"><p><strong>Example:</strong> "You are at a crossroads. The policy says: Turn Left."</p></div>
</li>
<li><strong>Value Function (The "Evaluation Map"):</strong> The value function tells you <strong>how good it is</strong> to be in a certain state or to take a certain action in a state.
<div class="example-rl"><p><strong>Example:</strong> "You are at a crossroads. The value function tells you: The path to the left has a high value because it leads to treasure. The path to the right has a low value because it leads to a dragon."</p></div>
</li>
</ul>
<p>Modern RL algorithms often learn both. They use the value function to evaluate how good their actions are, which in turn helps them improve their policy.</p>
<h2>🔹 5. Interaction Flow with Action & Policy</h2>
<p>The Action and Policy are at the heart of the agent's decision-making in the RL loop.</p>
<ol>
<li><strong>Agent observes state (s):</strong> "I am at a crossroad."</li>
<li><strong>Agent follows its policy (π) to choose an action (a):</strong> "My policy tells me to go left."</li>
<li><strong>Environment transitions and gives reward (r):</strong> The agent moves left, finds a gold coin (+10 reward), and arrives at a new state.</li>
<li><strong>Agent improves its policy:</strong> The agent thinks, "That was a great outcome! My policy was right to tell me to go left from that crossroad. I should strengthen that rule."</li>
</ol>
<h2>🔹 6. Detailed Examples</h2>
<div class="example-rl">
<h3>Example 1: Chess</h3>
<ul>
<li><strong>Actions:</strong> The set of all legal moves for the current player's pieces (e.g., move pawn e2 to e4, move knight g1 to f3). The action space changes with every state.</li>
<li><strong>Policy:</strong> A very complex strategy. A simple policy might be a set of human-written rules: "If my king is in check, my first priority is to move out of check." An advanced policy (like AlphaGo's) is a deep neural network that takes the board state as input and outputs a probability for every possible move.</li>
</ul>
</div>
<div class="example-rl">
<h3>Example 2: Self-Driving Car</h3>
<ul>
<li><strong>Actions:</strong> A continuous action space, often represented as a vector: `[steering_angle, acceleration, braking]`. For example, `[-5.2, 0.8, 0.0]` means steer 5.2 degrees left, accelerate at 80%, and don't brake.</li>
<li><strong>Policy:</strong> A highly sophisticated function that takes sensor data (camera, LiDAR) as input and outputs the continuous action vector. A simple part of the policy might be: "If the distance to the car in front is less than 10 meters and decreasing, the braking component of my action vector should be high."</li>
</ul>
</div>
<h2>🔹 7. Challenges</h2>
<ul>
<li><strong>Huge Action Spaces:</strong>
<div class="example-rl"><p><strong>Example:</strong> In a real-time strategy game like StarCraft, an action could be commanding any one of hundreds of units to do any one of a dozen things, leading to millions of possible actions at any moment.</p></div>
</li>
<li><strong>Designing Effective Policies (Exploration):</strong> How do you design a policy that not only exploits what it knows but also explores new actions to discover better strategies? This is the exploration-exploitation dilemma.</li>
<li><strong>Learning Stable Policies:</strong> In complex, dynamic environments, the feedback from actions can be noisy and delayed, making it very difficult for the policy to learn stable and reliable behaviors.</li>
</ul>
<div class="quiz-section">
<h2>📝 Quick Quiz: Test Your Knowledge</h2>
<ol>
<li><strong>What is the difference between a discrete and a continuous action space? Give an example of each.</strong></li>
<li><strong>What is the difference between a deterministic and a stochastic policy? When might a stochastic policy be useful?</strong></li>
<li><strong>Can an agent have a good policy without knowing the value function?</strong></li>
</ol>
<div class="quiz-answers">
<h3>Answers</h3>
<p><strong>1.</strong> A <strong>discrete</strong> action space has a finite number of distinct options (e.g., move left/right). A <strong>continuous</strong> action space has actions represented by real numbers in a range (e.g., turning a steering wheel by 15.7 degrees).</p>
<p><strong>2.</strong> A <strong>deterministic</strong> policy always chooses the same action for a state. A <strong>stochastic</strong> policy outputs a probability distribution over actions. A stochastic policy is very useful for <strong>exploration</strong> (trying new things) and for games where unpredictability is an advantage (like poker).</p>
<p><strong>3.</strong> Yes, but it's harder. Some algorithms, called "policy-gradient" methods, can directly search for a good policy without learning a value function. However, many of the most successful modern algorithms learn both, using the value function to help guide improvements to the policy.</p>
</div>
</div>
</div>
</body>
</html>
{% endblock %}
|