Spaces:

deedrop1140
/

machinelearningalgorithms

Sleeping

App Files Files Community

machinelearningalgorithms / templates /Action-and-Policy.html

deedrop1140

Upload 182 files

d0a6b4f verified 5 months ago

Raw

History Blame Contribute Delete

15.3 kB

	{% extends "layout.html" %}

	{% block content %}
	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>Study Guide: RL Action & Policy</title>
	<!-- MathJax for rendering mathematical formulas -->
	<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
	<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
	<style>
	/* General Body Styles */
	body {
	background-color: #ffffff; /* White background */
	color: #000000; /* Black text */
	font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
	font-weight: normal;
	line-height: 1.8;
	margin: 0;
	padding: 20px;
	}

	/* Container for centering content */
	.container {
	max-width: 800px;
	margin: 0 auto;
	padding: 20px;
	}

	/* Headings */
	h1, h2, h3 {
	color: #000000;
	border: none;
	font-weight: bold;
	}

	h1 {
	text-align: center;
	border-bottom: 3px solid #000;
	padding-bottom: 10px;
	margin-bottom: 30px;
	font-size: 2.5em;
	}

	h2 {
	font-size: 1.8em;
	margin-top: 40px;
	border-bottom: 1px solid #ddd;
	padding-bottom: 8px;
	}

	h3 {
	font-size: 1.3em;
	margin-top: 25px;
	}

	/* Main words are even bolder */
	strong {
	font-weight: 900;
	}

	/* Paragraphs and List Items with a line below */
	p, li {
	font-size: 1.1em;
	border-bottom: 1px solid #e0e0e0; /* Light gray line below each item */
	padding-bottom: 10px; /* Space between text and the line */
	margin-bottom: 10px; /* Space below the line */
	}

	/* Remove bottom border from the last item in a list for cleaner look */
	li:last-child {
	border-bottom: none;
	}

	/* Ordered lists */
	ol {
	list-style-type: decimal;
	padding-left: 20px;
	}

	ol li {
	padding-left: 10px;
	}

	/* Unordered Lists */
	ul {
	list-style-type: none;
	padding-left: 0;
	}

	ul li::before {
	content: "•";
	color: #000;
	font-weight: bold;
	display: inline-block;
	width: 1em;
	margin-left: 0;
	}

	/* Code block styling */
	pre {
	background-color: #f4f4f4;
	border: 1px solid #ddd;
	border-radius: 5px;
	padding: 15px;
	white-space: pre-wrap;
	word-wrap: break-word;
	font-family: "Courier New", Courier, monospace;
	font-size: 0.95em;
	font-weight: normal;
	color: #333;
	border-bottom: none;
	}

	/* RL Specific Styling */
	.story-rl {
	background-color: #fef2f2;
	border-left: 4px solid #dc3545; /* Red accent */
	margin: 15px 0;
	padding: 10px 15px;
	font-style: italic;
	color: #555;
	font-weight: normal;
	border-bottom: none;
	}

	.story-rl p, .story-rl li {
	border-bottom: none;
	}

	.example-rl {
	background-color: #fef7f7;
	padding: 15px;
	margin: 15px 0;
	border-radius: 5px;
	border-left: 4px solid #f17c87; /* Lighter Red accent */
	}

	.example-rl p, .example-rl li {
	border-bottom: none !important;
	}

	/* Quiz Styling */
	.quiz-section {
	background-color: #fafafa;
	border: 1px solid #ddd;
	border-radius: 5px;
	padding: 20px;
	margin-top: 30px;
	}
	.quiz-answers {
	background-color: #fef7f7;
	padding: 15px;
	margin-top: 15px;
	border-radius: 5px;
	}

	/* Table Styling */
	table {
	width: 100%;
	border-collapse: collapse;
	margin: 25px 0;
	}
	th, td {
	border: 1px solid #ddd;
	padding: 12px;
	text-align: left;
	}
	th {
	background-color: #f2f2f2;
	font-weight: bold;
	}

	/* --- Mobile Responsive Styles --- */
	@media (max-width: 768px) {
	body, .container {
	padding: 10px;
	}
	h1 { font-size: 2em; }
	h2 { font-size: 1.5em; }
	h3 { font-size: 1.2em; }
	p, li { font-size: 1em; }
	pre { font-size: 0.85em; }
	table, th, td { font-size: 0.9em; }
	}
	</style>
	</head>
	<body>

	<div class="container">
	<h1>🧠 Study Guide: Action & Policy in Reinforcement Learning</h1>

	<h2>🔹 1. Introduction</h2>
	<div class="story-rl">
	<p><strong>Story-style intuition: The Video Game Character</strong></p>
	<p>Think of a character in a video game. At any moment, the character has a set of possible moves they can make—jump, run, duck, attack. These are the character's <strong>Actions</strong>. The player controlling the character has a strategy in their head: "If a monster is close, I should attack. If there's a pit, I should jump." This strategy, this set of rules that dictates which action to take in any situation, is the <strong>Policy</strong>. In Reinforcement Learning, our goal is to teach the agent (the character) to learn the best possible policy on its own to win the game (maximize rewards).</p>
	</div>
	<p>In the world of RL, the <strong>Action</strong> is the "what" (what the agent does) and the <strong>Policy</strong> is the "how" (how the agent decides what to do). Together, they form the core of the agent's behavior.</p>

	<h2>🔹 2. Action (A)</h2>
	<p>An <strong>Action</strong> is one of the possible moves an agent can make in a given state. The set of all possible actions in a state is called the <strong>action space</strong>.</p>
	<h3>Types of Action Spaces:</h3>
	<ul>
	<li><strong>Discrete Actions:</strong> There is a finite, limited set of distinct actions the agent can choose from.
	<div class="example-rl"><p><strong>Example:</strong> In a maze, the actions are {Up, Down, Left, Right}. In a game of tic-tac-toe, the actions are placing your mark in one of the empty squares.</p></div>
	</li>
	<li><strong>Continuous Actions:</strong> The actions are described by real-valued numbers within a certain range.
	<div class="example-rl"><p><strong>Example:</strong> For a self-driving car, the action of steering can be any angle between -45.0 and +45.0 degrees. For a thermostat, the action is setting a temperature, which can be any value like 20.5°C.</p></div>
	</li>
	</ul>
	<p>The set of available actions can depend on the current state, denoted as \( A(s) \).</p>

	<h2>🔹 3. Policy (π)</h2>
	<p>A <strong>Policy</strong> is the agent's strategy or "brain." It is a rule that maps a state to an action. The ultimate goal of RL is to find an <strong>optimal policy</strong>—a policy that maximizes the total expected reward over time.</p>
	<p>Mathematically, a policy is a distribution over actions given a state: \( \pi(a\|s) = P(A_t = a \mid S_t = s) \)</p>
	<h3>Types of Policies:</h3>
	<ul>
	<li><strong>Deterministic Policy:</strong> The policy always outputs the same action for a given state. There is no randomness.
	<div class="example-rl"><p><strong>Story Example:</strong> A self-driving car's policy is deterministic: "If the traffic light state is 'Red', the action is always 'Brake'." There is no chance it will do something else.</p></div>
	<p>Formula: \( a = \pi(s) \)</p>
	</li>
	<li><strong>Stochastic Policy:</strong> The policy outputs a probability distribution over actions for a given state. The agent then samples from this distribution to choose its next action.
	<div class="example-rl"><p><strong>Story Example:</strong> A poker-playing bot might have a stochastic policy. In a certain state, its policy might be: "70% chance of 'Raising', 30% chance of 'Folding'." This randomness makes the agent's behavior less predictable to opponents and is crucial for exploration.</p></div>
	<p>Formula: \( a \sim \pi(\cdot\|s) \)</p>
	</li>
	</ul>

	<h2>🔹 4. Policy vs. Value Function</h2>
	<p>It's crucial to distinguish between a policy and a value function, as they work together to guide the agent.</p>

	<ul>
	<li><strong>Policy (The "How-To" Guide):</strong> The policy tells you <strong>what to do</strong> in a state.
	<div class="example-rl"><p><strong>Example:</strong> "You are at a crossroads. The policy says: Turn Left."</p></div>
	</li>
	<li><strong>Value Function (The "Evaluation Map"):</strong> The value function tells you <strong>how good it is</strong> to be in a certain state or to take a certain action in a state.
	<div class="example-rl"><p><strong>Example:</strong> "You are at a crossroads. The value function tells you: The path to the left has a high value because it leads to treasure. The path to the right has a low value because it leads to a dragon."</p></div>
	</li>
	</ul>
	<p>Modern RL algorithms often learn both. They use the value function to evaluate how good their actions are, which in turn helps them improve their policy.</p>

	<h2>🔹 5. Interaction Flow with Action & Policy</h2>
	<p>The Action and Policy are at the heart of the agent's decision-making in the RL loop.</p>
	<ol>
	<li><strong>Agent observes state (s):</strong> "I am at a crossroad."</li>
	<li><strong>Agent follows its policy (π) to choose an action (a):</strong> "My policy tells me to go left."</li>
	<li><strong>Environment transitions and gives reward (r):</strong> The agent moves left, finds a gold coin (+10 reward), and arrives at a new state.</li>
	<li><strong>Agent improves its policy:</strong> The agent thinks, "That was a great outcome! My policy was right to tell me to go left from that crossroad. I should strengthen that rule."</li>
	</ol>

	<h2>🔹 6. Detailed Examples</h2>
	<div class="example-rl">
	<h3>Example 1: Chess</h3>
	<ul>
	<li><strong>Actions:</strong> The set of all legal moves for the current player's pieces (e.g., move pawn e2 to e4, move knight g1 to f3). The action space changes with every state.</li>
	<li><strong>Policy:</strong> A very complex strategy. A simple policy might be a set of human-written rules: "If my king is in check, my first priority is to move out of check." An advanced policy (like AlphaGo's) is a deep neural network that takes the board state as input and outputs a probability for every possible move.</li>
	</ul>
	</div>
	<div class="example-rl">
	<h3>Example 2: Self-Driving Car</h3>
	<ul>
	<li><strong>Actions:</strong> A continuous action space, often represented as a vector: `[steering_angle, acceleration, braking]`. For example, `[-5.2, 0.8, 0.0]` means steer 5.2 degrees left, accelerate at 80%, and don't brake.</li>
	<li><strong>Policy:</strong> A highly sophisticated function that takes sensor data (camera, LiDAR) as input and outputs the continuous action vector. A simple part of the policy might be: "If the distance to the car in front is less than 10 meters and decreasing, the braking component of my action vector should be high."</li>
	</ul>
	</div>

	<h2>🔹 7. Challenges</h2>
	<ul>
	<li><strong>Huge Action Spaces:</strong>
	<div class="example-rl"><p><strong>Example:</strong> In a real-time strategy game like StarCraft, an action could be commanding any one of hundreds of units to do any one of a dozen things, leading to millions of possible actions at any moment.</p></div>
	</li>
	<li><strong>Designing Effective Policies (Exploration):</strong> How do you design a policy that not only exploits what it knows but also explores new actions to discover better strategies? This is the exploration-exploitation dilemma.</li>
	<li><strong>Learning Stable Policies:</strong> In complex, dynamic environments, the feedback from actions can be noisy and delayed, making it very difficult for the policy to learn stable and reliable behaviors.</li>
	</ul>

	<div class="quiz-section">
	<h2>📝 Quick Quiz: Test Your Knowledge</h2>
	<ol>
	<li><strong>What is the difference between a discrete and a continuous action space? Give an example of each.</strong></li>
	<li><strong>What is the difference between a deterministic and a stochastic policy? When might a stochastic policy be useful?</strong></li>
	<li><strong>Can an agent have a good policy without knowing the value function?</strong></li>
	</ol>
	<div class="quiz-answers">
	<h3>Answers</h3>
	<p><strong>1.</strong> A <strong>discrete</strong> action space has a finite number of distinct options (e.g., move left/right). A <strong>continuous</strong> action space has actions represented by real numbers in a range (e.g., turning a steering wheel by 15.7 degrees).</p>
	<p><strong>2.</strong> A <strong>deterministic</strong> policy always chooses the same action for a state. A <strong>stochastic</strong> policy outputs a probability distribution over actions. A stochastic policy is very useful for <strong>exploration</strong> (trying new things) and for games where unpredictability is an advantage (like poker).</p>
	<p><strong>3.</strong> Yes, but it's harder. Some algorithms, called "policy-gradient" methods, can directly search for a good policy without learning a value function. However, many of the most successful modern algorithms learn both, using the value function to help guide improvements to the policy.</p>
	</div>
	</div>

	</div>

	</body>
	</html>
	{% endblock %}