{% extends "layout.html" %} {% block content %}
Story-style intuition: The Video Game Character
Think of a character in a video game. At any moment, the character has a set of possible moves they can make—jump, run, duck, attack. These are the character's Actions. The player controlling the character has a strategy in their head: "If a monster is close, I should attack. If there's a pit, I should jump." This strategy, this set of rules that dictates which action to take in any situation, is the Policy. In Reinforcement Learning, our goal is to teach the agent (the character) to learn the best possible policy on its own to win the game (maximize rewards).
In the world of RL, the Action is the "what" (what the agent does) and the Policy is the "how" (how the agent decides what to do). Together, they form the core of the agent's behavior.
An Action is one of the possible moves an agent can make in a given state. The set of all possible actions in a state is called the action space.
Example: In a maze, the actions are {Up, Down, Left, Right}. In a game of tic-tac-toe, the actions are placing your mark in one of the empty squares.
Example: For a self-driving car, the action of steering can be any angle between -45.0 and +45.0 degrees. For a thermostat, the action is setting a temperature, which can be any value like 20.5°C.
The set of available actions can depend on the current state, denoted as \( A(s) \).
A Policy is the agent's strategy or "brain." It is a rule that maps a state to an action. The ultimate goal of RL is to find an optimal policy—a policy that maximizes the total expected reward over time.
Mathematically, a policy is a distribution over actions given a state: \( \pi(a|s) = P(A_t = a \mid S_t = s) \)
Story Example: A self-driving car's policy is deterministic: "If the traffic light state is 'Red', the action is always 'Brake'." There is no chance it will do something else.
Formula: \( a = \pi(s) \)
Story Example: A poker-playing bot might have a stochastic policy. In a certain state, its policy might be: "70% chance of 'Raising', 30% chance of 'Folding'." This randomness makes the agent's behavior less predictable to opponents and is crucial for exploration.
Formula: \( a \sim \pi(\cdot|s) \)
It's crucial to distinguish between a policy and a value function, as they work together to guide the agent.
Example: "You are at a crossroads. The policy says: Turn Left."
Example: "You are at a crossroads. The value function tells you: The path to the left has a high value because it leads to treasure. The path to the right has a low value because it leads to a dragon."
Modern RL algorithms often learn both. They use the value function to evaluate how good their actions are, which in turn helps them improve their policy.
The Action and Policy are at the heart of the agent's decision-making in the RL loop.
Example: In a real-time strategy game like StarCraft, an action could be commanding any one of hundreds of units to do any one of a dozen things, leading to millions of possible actions at any moment.
1. A discrete action space has a finite number of distinct options (e.g., move left/right). A continuous action space has actions represented by real numbers in a range (e.g., turning a steering wheel by 15.7 degrees).
2. A deterministic policy always chooses the same action for a state. A stochastic policy outputs a probability distribution over actions. A stochastic policy is very useful for exploration (trying new things) and for games where unpredictability is an advantage (like poker).
3. Yes, but it's harder. Some algorithms, called "policy-gradient" methods, can directly search for a good policy without learning a value function. However, many of the most successful modern algorithms learn both, using the value function to help guide improvements to the policy.