File size: 15,278 Bytes
d0a6b4f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
{% extends "layout.html" %}

{% block content %}
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Study Guide: RL Action & Policy</title>
    <!-- MathJax for rendering mathematical formulas -->
    <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
    <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
    <style>

        /* General Body Styles */

        body {

            background-color: #ffffff; /* White background */

            color: #000000; /* Black text */

            font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;

            font-weight: normal;

            line-height: 1.8;

            margin: 0;

            padding: 20px;

        }



        /* Container for centering content */

        .container {

            max-width: 800px;

            margin: 0 auto;

            padding: 20px;

        }



        /* Headings */

        h1, h2, h3 {

            color: #000000;

            border: none;

            font-weight: bold;

        }



        h1 {

            text-align: center;

            border-bottom: 3px solid #000;

            padding-bottom: 10px;

            margin-bottom: 30px;

            font-size: 2.5em;

        }



        h2 {

            font-size: 1.8em;

            margin-top: 40px;

            border-bottom: 1px solid #ddd;

            padding-bottom: 8px;

        }



        h3 {

            font-size: 1.3em;

            margin-top: 25px;

        }



        /* Main words are even bolder */

        strong {

            font-weight: 900;

        }



        /* Paragraphs and List Items with a line below */

        p, li {

            font-size: 1.1em;

            border-bottom: 1px solid #e0e0e0; /* Light gray line below each item */

            padding-bottom: 10px; /* Space between text and the line */

            margin-bottom: 10px; /* Space below the line */

        }



        /* Remove bottom border from the last item in a list for cleaner look */

        li:last-child {

            border-bottom: none;

        }

        

        /* Ordered lists */

        ol {

            list-style-type: decimal;

            padding-left: 20px;

        }

        

        ol li {

            padding-left: 10px;

        }



        /* Unordered Lists */

        ul {

            list-style-type: none;

            padding-left: 0;

        }



        ul li::before {

            content: "•";

            color: #000;

            font-weight: bold;

            display: inline-block;

            width: 1em;

            margin-left: 0;

        }

        

        /* Code block styling */

        pre {

            background-color: #f4f4f4;

            border: 1px solid #ddd;

            border-radius: 5px;

            padding: 15px;

            white-space: pre-wrap;

            word-wrap: break-word;

            font-family: "Courier New", Courier, monospace;

            font-size: 0.95em;

            font-weight: normal;

            color: #333;

            border-bottom: none;

        }

        

        /* RL Specific Styling */

        .story-rl {

             background-color: #fef2f2;

             border-left: 4px solid #dc3545; /* Red accent */

             margin: 15px 0;

             padding: 10px 15px;

             font-style: italic;

             color: #555;

             font-weight: normal;

             border-bottom: none;

        }

        

        .story-rl p, .story-rl li {

            border-bottom: none;

        }

        

        .example-rl {

            background-color: #fef7f7;

            padding: 15px;

            margin: 15px 0;

            border-radius: 5px;

            border-left: 4px solid #f17c87; /* Lighter Red accent */

        }

        

        .example-rl p, .example-rl li {

            border-bottom: none !important;

        }

        

        /* Quiz Styling */

        .quiz-section {

             background-color: #fafafa;

             border: 1px solid #ddd;

             border-radius: 5px;

             padding: 20px;

             margin-top: 30px;

        }

        .quiz-answers {

             background-color: #fef7f7;

             padding: 15px;

             margin-top: 15px;

             border-radius: 5px;

        }



        /* Table Styling */

        table {

            width: 100%;

            border-collapse: collapse;

            margin: 25px 0;

        }

        th, td {

            border: 1px solid #ddd;

            padding: 12px;

            text-align: left;

        }

        th {

            background-color: #f2f2f2;

            font-weight: bold;

        }



        /* --- Mobile Responsive Styles --- */

        @media (max-width: 768px) {

            body, .container {

                padding: 10px;

            }

            h1 { font-size: 2em; }

            h2 { font-size: 1.5em; }

            h3 { font-size: 1.2em; }

            p, li { font-size: 1em; }

            pre { font-size: 0.85em; }

            table, th, td { font-size: 0.9em; }

        }

    </style>
</head>
<body>

    <div class="container">
        <h1>🧠 Study Guide: Action & Policy in Reinforcement Learning</h1>

        <h2>🔹 1. Introduction</h2>
        <div class="story-rl">
            <p><strong>Story-style intuition: The Video Game Character</strong></p>
            <p>Think of a character in a video game. At any moment, the character has a set of possible moves they can make—jump, run, duck, attack. These are the character's <strong>Actions</strong>. The player controlling the character has a strategy in their head: "If a monster is close, I should attack. If there's a pit, I should jump." This strategy, this set of rules that dictates which action to take in any situation, is the <strong>Policy</strong>. In Reinforcement Learning, our goal is to teach the agent (the character) to learn the best possible policy on its own to win the game (maximize rewards).</p>
        </div>
        <p>In the world of RL, the <strong>Action</strong> is the "what" (what the agent does) and the <strong>Policy</strong> is the "how" (how the agent decides what to do). Together, they form the core of the agent's behavior.</p>

        <h2>🔹 2. Action (A)</h2>
        <p>An <strong>Action</strong> is one of the possible moves an agent can make in a given state. The set of all possible actions in a state is called the <strong>action space</strong>.</p>
        <h3>Types of Action Spaces:</h3>
        <ul>
            <li><strong>Discrete Actions:</strong> There is a finite, limited set of distinct actions the agent can choose from.
                <div class="example-rl"><p><strong>Example:</strong> In a maze, the actions are {Up, Down, Left, Right}. In a game of tic-tac-toe, the actions are placing your mark in one of the empty squares.</p></div>
            </li>
            <li><strong>Continuous Actions:</strong> The actions are described by real-valued numbers within a certain range.
                <div class="example-rl"><p><strong>Example:</strong> For a self-driving car, the action of steering can be any angle between -45.0 and +45.0 degrees. For a thermostat, the action is setting a temperature, which can be any value like 20.5°C.</p></div>
            </li>
        </ul>
        <p>The set of available actions can depend on the current state, denoted as \( A(s) \).</p>

        <h2>🔹 3. Policy (π)</h2>
        <p>A <strong>Policy</strong> is the agent's strategy or "brain." It is a rule that maps a state to an action. The ultimate goal of RL is to find an <strong>optimal policy</strong>—a policy that maximizes the total expected reward over time.</p>
        <p>Mathematically, a policy is a distribution over actions given a state: \( \pi(a|s) = P(A_t = a \mid S_t = s) \)</p>
        <h3>Types of Policies:</h3>
        <ul>
            <li><strong>Deterministic Policy:</strong> The policy always outputs the same action for a given state. There is no randomness.
                <div class="example-rl"><p><strong>Story Example:</strong> A self-driving car's policy is deterministic: "If the traffic light state is 'Red', the action is always 'Brake'." There is no chance it will do something else.</p></div>
                <p>Formula: \( a = \pi(s) \)</p>
            </li>
            <li><strong>Stochastic Policy:</strong> The policy outputs a probability distribution over actions for a given state. The agent then samples from this distribution to choose its next action.
                <div class="example-rl"><p><strong>Story Example:</strong> A poker-playing bot might have a stochastic policy. In a certain state, its policy might be: "70% chance of 'Raising', 30% chance of 'Folding'." This randomness makes the agent's behavior less predictable to opponents and is crucial for exploration.</p></div>
                <p>Formula: \( a \sim \pi(\cdot|s) \)</p>
            </li>
        </ul>
        
        <h2>🔹 4. Policy vs. Value Function</h2>
        <p>It's crucial to distinguish between a policy and a value function, as they work together to guide the agent.</p>
        
        <ul>
            <li><strong>Policy (The "How-To" Guide):</strong> The policy tells you <strong>what to do</strong> in a state.
                <div class="example-rl"><p><strong>Example:</strong> "You are at a crossroads. The policy says: Turn Left."</p></div>
            </li>
            <li><strong>Value Function (The "Evaluation Map"):</strong> The value function tells you <strong>how good it is</strong> to be in a certain state or to take a certain action in a state.
                <div class="example-rl"><p><strong>Example:</strong> "You are at a crossroads. The value function tells you: The path to the left has a high value because it leads to treasure. The path to the right has a low value because it leads to a dragon."</p></div>
            </li>
        </ul>
        <p>Modern RL algorithms often learn both. They use the value function to evaluate how good their actions are, which in turn helps them improve their policy.</p>

        <h2>🔹 5. Interaction Flow with Action & Policy</h2>
        <p>The Action and Policy are at the heart of the agent's decision-making in the RL loop.</p>
        <ol>
            <li><strong>Agent observes state (s):</strong> "I am at a crossroad."</li>
            <li><strong>Agent follows its policy (π) to choose an action (a):</strong> "My policy tells me to go left."</li>
            <li><strong>Environment transitions and gives reward (r):</strong> The agent moves left, finds a gold coin (+10 reward), and arrives at a new state.</li>
            <li><strong>Agent improves its policy:</strong> The agent thinks, "That was a great outcome! My policy was right to tell me to go left from that crossroad. I should strengthen that rule."</li>
        </ol>

        <h2>🔹 6. Detailed Examples</h2>
        <div class="example-rl">
            <h3>Example 1: Chess</h3>
            <ul>
                <li><strong>Actions:</strong> The set of all legal moves for the current player's pieces (e.g., move pawn e2 to e4, move knight g1 to f3). The action space changes with every state.</li>
                <li><strong>Policy:</strong> A very complex strategy. A simple policy might be a set of human-written rules: "If my king is in check, my first priority is to move out of check." An advanced policy (like AlphaGo's) is a deep neural network that takes the board state as input and outputs a probability for every possible move.</li>
            </ul>
        </div>
        <div class="example-rl">
            <h3>Example 2: Self-Driving Car</h3>
            <ul>
                <li><strong>Actions:</strong> A continuous action space, often represented as a vector: `[steering_angle, acceleration, braking]`. For example, `[-5.2, 0.8, 0.0]` means steer 5.2 degrees left, accelerate at 80%, and don't brake.</li>
                <li><strong>Policy:</strong> A highly sophisticated function that takes sensor data (camera, LiDAR) as input and outputs the continuous action vector. A simple part of the policy might be: "If the distance to the car in front is less than 10 meters and decreasing, the braking component of my action vector should be high."</li>
            </ul>
        </div>
        
        <h2>🔹 7. Challenges</h2>
        <ul>
            <li><strong>Huge Action Spaces:</strong>
                <div class="example-rl"><p><strong>Example:</strong> In a real-time strategy game like StarCraft, an action could be commanding any one of hundreds of units to do any one of a dozen things, leading to millions of possible actions at any moment.</p></div>
            </li>
            <li><strong>Designing Effective Policies (Exploration):</strong> How do you design a policy that not only exploits what it knows but also explores new actions to discover better strategies? This is the exploration-exploitation dilemma.</li>
            <li><strong>Learning Stable Policies:</strong> In complex, dynamic environments, the feedback from actions can be noisy and delayed, making it very difficult for the policy to learn stable and reliable behaviors.</li>
        </ul>
        
        <div class="quiz-section">
            <h2>📝 Quick Quiz: Test Your Knowledge</h2>
            <ol>
                <li><strong>What is the difference between a discrete and a continuous action space? Give an example of each.</strong></li>
                <li><strong>What is the difference between a deterministic and a stochastic policy? When might a stochastic policy be useful?</strong></li>
                <li><strong>Can an agent have a good policy without knowing the value function?</strong></li>
            </ol>
             <div class="quiz-answers">
                <h3>Answers</h3>
                <p><strong>1.</strong> A <strong>discrete</strong> action space has a finite number of distinct options (e.g., move left/right). A <strong>continuous</strong> action space has actions represented by real numbers in a range (e.g., turning a steering wheel by 15.7 degrees).</p>
                <p><strong>2.</strong> A <strong>deterministic</strong> policy always chooses the same action for a state. A <strong>stochastic</strong> policy outputs a probability distribution over actions. A stochastic policy is very useful for <strong>exploration</strong> (trying new things) and for games where unpredictability is an advantage (like poker).</p>
                <p><strong>3.</strong> Yes, but it's harder. Some algorithms, called "policy-gradient" methods, can directly search for a good policy without learning a value function. However, many of the most successful modern algorithms learn both, using the value function to help guide improvements to the policy.</p>
            </div>
        </div>

    </div>

</body>
</html>
{% endblock %}