import ResponsiveImage from '../../components/ResponsiveImage.astro'
import Ch3LearningBenefits from '../assets/image/ch3/ch3-learning-benefits.png'
import Ch3LearningAtlas from '../assets/image/ch3/ch3-learning-atlas.png'
import Ch3RlExamples from '../assets/image/ch3/ch3-rl-examples.png'
import Ch3AgentEnv from '../assets/image/ch3/ch3-agent-env.png'
import Ch3RlAlgorithmsAtlas from '../assets/image/ch3/ch3-rl-algorithms-atlas.png'
import Ch3DuckSimVsReal from '../assets/image/ch3/ch3-duck-sim-vs-real.png'
import Ch3ManyDucks from '../assets/image/ch3/ch3-many-ducks.png'
import Ch3HilSerlExamples from '../assets/image/ch3/ch3-hil-serl-examples.png'


# Robot (Reinforcement) Learning

::: epigraph
*Approximate the solution, not the problem* [\...]

Richard Sutton
:::

> **TL;DR**
> The need for expensive high-fidelity simulators can be obviated by learning from real-world data, using sample-efficient algorithms that can safely train directly on hardware.

<ResponsiveImage src={Ch3LearningBenefits} alt="Learning-based robotics streamlines perception-to-action by learning a (1) unified high-level controller capable to take (2) high-dimensional, unstructured sensorimotor information. Learning (3) does not require a dynamics model and instead focuses on interaction data, and (4) empirically correlates with
the scale of the data used." id="fig-fig:robot-learning-upsides" />

*Learning-based robotics streamlines perception-to-action by learning a (1) unified high-level controller capable to take (2) high-dimensional, unstructured sensorimotor information. Learning (3) does not require a dynamics model and instead focuses on interaction data, and (4) empirically correlates with
the scale of the data used.*

Learning-based techniques for robotics naturally address the limitations presented in the following section\ (Figure Section fig:robot-learning-upsides).
Learning-based techniques typically rely on prediction-to-action (*visuomotor policies*), thereby directly mapping sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components.
Mapping sensorimotor inputs to actions directly also allows to add diverse input modalities, leveraging the automatic feature extraction characteristic of most modern learning systems.
Further, learning-based approaches can in principle entirely bypass modeling efforts and instead rely exclusively on interactions data, proving transformative when dynamics are challenging to model or even entirely unknown.
Lastly, learning for robotics (*robot learning*) is naturally well posed to leverage the growing amount of robotics data openly available, just as computer vision first and natural language processing later did historically benefit from large scale corpora of (possibly non curated) data, in great part overlooked by dynamics-based approaches.

Being a field at its relative nascent stages, no prevalent technique(s) proved distinctly better better in robot learning.
Still, two major classes of methods gained prominence: and (Figure Section fig:robot-learning-atlas).
In this section, we provide a conceptual overview of applications of the former to robotics, as well as introduce practical examples of how to use RL within **LeRobot**.
We then introduce the major limitations RL suffers from, to introduce BC techniques in the next sections (the following section, sec:learning-bc-generalist).

<ResponsiveImage src={Ch3LearningAtlas} alt="Overview of the robot learning methods implemented in **LeRobot**." id="fig-fig:robot-learning-atlas" />

*Overview of the robot learning methods implemented in **LeRobot**.*

In Figure the referenced figure we decided to include generalist robot models [@black $p_0$ VisionLanguageActionFlow2024,shukorSmolVLAVisionLanguageActionModel2025] alongside task-specific BC methods.
While significant different in spirit---*generalist* models are language-conditioned and use instructions to generate motion valid across many tasks, while *task-specific* models are typically not language-conditioned and used to perform a single task---foundation models are largely trained to reproduce trajectories contained in a large training set of input demonstrations.
Thus, we argue generalist policies can indeed be grouped alongside other task-specific BC methods, as they both leverage similar training data and schemas.

Figure the referenced figure illustrates this categorization graphically, explicitly listing all the robot learning policies currently available in **LeRobot**: Action Chunking with Transformers (ACT) [@zhaoLearningFineGrainedBimanual2023], Diffusion Policy [@chiDiffusionPolicyVisuomotor2024], Vector-Quantized Behavior Transformer (VQ-BeT) [@leeBehaviorGenerationLatent2024], $\pi_0$ [@black $p_0$ VisionLanguageActionFlow2024], SmolVLA [@shukorSmolVLAVisionLanguageActionModel2025], Human-in-the-loop Sample-efficient RL (HIL-SERL) [@luoPreciseDexterousRobotic2024] and TD-MPC [@hansenTemporalDifferenceLearning2022].

<ResponsiveImage src={Ch3RlExamples} alt="Examples of two different robotics tasks performed using RL. In the manipulation task (A) an agent learns to reach for a yellow plastic block in its environment, and to put it inside of a box. In the locomotion task (B) an agent learns to move its center of mass sideways without falling." id="fig-fig:robotics-with-rl-examples" />

*Examples of two different robotics tasks performed using RL. In the manipulation task (A) an agent learns to reach for a yellow plastic block in its environment, and to put it inside of a box. In the locomotion task (B) an agent learns to move its center of mass sideways without falling.*

Applications of RL to robotics have been long studied, to the point the relationship between these two disciplines has been compared to that between physics and matematics [@koberReinforcementLearningRobotics].
Indeed, due to their interactive and sequential nature, many robotics problems can be directly mapped to RL problems.
Figure the referenced figure depicts two of such cases.
Reaching for an object to move somewhere else in the scene is an indeed sequential problem where at each cycle the controller needs to adjust the position of the robotic arm based on their current configuration and the (possibly varying) position of the object.
Figure the referenced figure also shows an example of a locomotion problem, where sequentiality is inherent in the problem formulation.
While sliding to the side, the controller has to constantly keep adjusting to the robot's propioperception to avoid failure (falling).

## A (Concise) Introduction to RL

The RL framework [@suttonReinforcementLearningIntroduction2018], which we briefly introduce here, has often been used to model robotics problems [@koberReinforcementLearningRobotics].
RL is a subfield within ML fundamentally concerned with the development of autonomous systems (*agents*) learning how to *continuously behave* in an evolving environment, developing (ideally, well-performing) control strategies (*policies*).
Crucially for robotics, RL agents can improve via trial-and-error only, thus entirely bypassing the need to develop explicit models of the problem dynamics, and rather exploiting interaction data only.
In RL, this feedback loop (Figure Section fig:rl-most-famous-pic) between actions and outcomes is established through the agent sensing a scalar quantity (*reward*).

<ResponsiveImage src={Ch3AgentEnv} alt="Agent-Environment interaction diagram (image credits to [@suttonReinforcementLearningIntroduction2018])." id="fig-fig:rl-most-famous-pic" />

*Agent-Environment interaction diagram (image credits to [@suttonReinforcementLearningIntroduction2018]).*

Formally, interactions between an agent and its environment are typically modeled via a Markov Decision Process (MDP) [@bellmanMarkovianDecisionProcess1957].
Representing robotics problems via MDPs offers several advantages, including (1) incorporating uncertainty through MDP's inherently stochastic formulation and (2) providing a theoretically sound framework for learning *without* an explicit dynamic model.
While accommodating also a continuous time formulation, MDPs are typically considered in discrete time in RL, thus assuming interactions to atomically take place over the course of discrete *timestep* $t=0,1,2,3, \dots, T $.
MDPs allowing for an unbounded number of interactions ( $ T \to + \infty $ ) are typically termed *infinite-horizon*, and opposed to *finite-horizon* MDPs in which $ T$ cannot grow unbounded.
Unless diversely specified, we will only be referring to discrete-time finite-horizon (*episodic*) MDPs here.

Formally, a lenght-$T$ Markov Decision Process (MDP) is a tuple $\mathcal M = \langle \mathcal\{S\}, \mathcal\{A\}, \mathcal\{D\}, r, \gamma, \rho, T \rangle$, where:

- $\mathcal\{S\}$ is the *state space*; $s_t \in \mathcal\{S\}$ denotes the (possibly non-directly observable) environment state at time $t $. In robotics, states often comprise robot configuration and velocities ($ q_t, \dot q_t$), and can accomodate sensor readings such as camera or audio streams.

- $\mathcal\{A\}$ is the *action space*; $a_t \in \mathcal\{A\}$ may represent joint torques, joint velocities, or even end-effector commands. In general, actions correspond to commands intervenings on the configuration of the robot.

- $\mathcal\{D\}$ represents the (possibly non-deterministic) environment dynamics, with $\mathcal\{D\}: \mathcal\{S\} \times \mathcal\{A\} \times \mathcal\{S\} \mapsto [0, 1]$ corresponding to $\mathcal\{D\} \, (s_t, a_t, s_\{t+1\}) = \mathbb\{P\}(s_\{t+1\} | s_t, a_t)$. For instance, for a planar manipulator dynamics could be considered deterministic when the environment is fully described (Figure Section fig:planar-manipulation-simple), and stochastic when unmodeled disturbances depending on non-observable parameters intervene (Figure Section fig:planar-manipulator-box-velocity).

- $r: \mathcal\{S\} \times \mathcal\{A\} \times \mathcal\{S\} \to \mathbb R$ is the *reward function*, weighing the transition $(s_t, a_t, s_\{t+1\})$ in the context of the achievement of an arbitrary goal. For instance, a simple reward function for quickly moving the along the $x $ axis in 3D-space (Figure Section fig:robotics-with-rl-examples) could be based on the absolute position of the robot along the $ x$ axis ($p_x $), present negative penalties for falling over (measured from $ p_z$) and a introduce bonuses $\dot p_x $ for speed, $ r (s_t, a_t, s_\{t+1\}) \equiv r(s_t) = p_\{x_t\} \cdot \dot p_\{x_t\} - \tfrac\{1\}\{p_\{z_t\}\}$.

Lastly, $\gamma \in [0,1]$ represent the discount factor regulating preference for immediate versus long-term reward (with an effective horizon equal to $\tfrac\{1\}\{1-\gamma\}$), and $\rho$ is the distribution, defined over $\mathcal\{S\}$, the MDP's *initial* state is sampled from, $s_0 \sim \rho $.

A length-$ T$ *trajectory* is the (random) sequence
$$\begin{equation}

    \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_{T-1}, a_{T-1}, r_{T-1}, s_T),
\end{equation}$
with per-step rewards defined as  r_t = r (s_t, a_t, s_\{t+1\})  for ease of notation. Interestingly, assuming both the environment dynamics and conditional distribution over actions given states---the *policy*---to be *Markovian*:

$\mathbb P(s_{t+1} \vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) &= \mathbb P (s_{t+1} | s_t, a_t) 

\mathbb P(a_t \vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) &= \mathbb P(a_t \vert s_t) $

The probability of observing a given trajectory  \tau  factorizes into
$\begin{equation}

    \mathbb P(\tau) = \mathbb P (s_0) \prod_{t=0}^{T-1} \mathbb P (s_{t+1} | s_t, a_t) \ \mathbb P(a_t \vert s_t).
\end{equation}$$

Policies $\mathbb P(a_t \vert s_t)$ are typically indicated as $\pi(a_t \vert s_t)$, and often parametrized via $\theta$, yielding $\pi_\theta (a_t \vert s_t )$.
Policies are trained optimizing the (discounted) *return* associated to a given $\tau$, i.e. the (random) sum of measured rewards over trajectory:
$$G(\tau) = \sum_\{t=0\}^\{T-1\} \gamma^\{t\} r_t.$$
In that, agents seek to learn control strategies (*policies*, $\pi_\theta$) maximizing the expected return $\mathbb E_\{\tau \sim \pi_\theta\} G(\tau)$.
For a given dynamics $\mathcal D$---i.e., for a given problem---taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies $\Pi $, yielding the (maximization) target $ J : \Pi \mapsto \mathbb R$

$$J(\pi_\theta) &= \mathbb E_\{\tau \sim \mathbb P_\{\theta; \mathcal D\}\} \left[ G(\tau) \right], 

    \mathbb P_\{\theta; \mathcal D\} (\tau) &= \rho \prod_\{t=0\}^\{T-1\} \mathcal D (s_t, a_t, s_\{t+1\}) \ \pi_\theta (a_t \vert s_t).$$

Because in the RL framework the agent is assumed to only be able to observe the environment dynamics and not to intervene on them, the referenced figure varies exclusively with the policy followed.
In turn, MDPs naturally provide a framework to optimize over the space of the possible behaviors an agent might enact ($\pi \in \Pi$), searching for the *optimal policy* $\pi^* = \arg \max_\{\theta\} J(\pi_\theta)$, where $\theta$ is the parametrization adopted by the policy set $\Pi: \pi_\theta \in \Pi, \ \forall \theta $.
Other than providing a target for policy search, $ G(\tau)$ can also be used as a target to discriminate between states and state-action pairs.
Given any state $s \in \mathcal\{S\}$---e.g., a given configuration of the robot---the *state-value* function
$$V_\pi(s) = \mathbb E_\{\tau \sim \pi\} \left[ G(\tau) \big \vert s_0 = s \right]$$
can be used to discriminate between desirable and undesirable state in terms of long-term (discounted) reward maximization, under a given policy $\pi $.
Similarily, the *state-action* value function also conditions the cumulative discounted reward on selecting action $ a$ when in $s$, and thereafter act according to $\pi $:
$ Q_\pi(s,a) = \mathbb E_\{\tau \sim \pi\} \left[ G (\tau) \big \vert s_0 = s, a_0=a \right]  
Crucially, value functions are interrelated:

$Q_\pi(s_t, a_t) &= \mathbb{E}_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \left[ r_t + \gamma V_\pi(s_{t+1}) \right] 

V_\pi(s_t) &= \mathbb E_{a_t \sim \pi(\bullet \vert s_t)} \left[ Q_\pi (s_t, a_t) \right]
$$

Inducing an ordering over states and state-action pairs under $\pi$, value functions are central to most RL algorithms.
A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure Section fig:rl-algos-atlas).

<ResponsiveImage src={Ch3RlAlgorithmsAtlas} alt="Popular RL algorithms. See [@SpinningUp2018] for a complete list of citations." id="fig-fig:rl-algos-atlas" />

*Popular RL algorithms. See [@SpinningUp2018] for a complete list of citations.*

Popular approaches to continuous state and action space---such as those studied within robotics---include @schulmanTrustRegionPolicy2017, schulmanProximalPolicyOptimization2017, haarnojaSoftActorCriticOffPolicy2018.
Across manipulation [@akkayaSolvingRubiksCube2019] and locomotion [@leeLearningQuadrupedalLocomotion2020] problems, RL proved extremely effective in providing a platform to (1) adopt a unified, streamlined perception-to-action pipeline, (2) natively integrate propioperception with multi-modal high-dimensional sensor streams (3) disregard a description of the environment dynamics, by focusing on observed interaction data rather than modeling, and (4) anchor policies in the experience collected and stored in datasets.
For a more complete survey of applications of RL to robotics, we refer the reader to @koberReinforcementLearningRobotics,tangDeepReinforcementLearning2024.

## Real-world RL for Robotics

Streamlined end-to-end control pipelines, data-driven feature extraction and a disregard for explicit modeling in favor of interaction data are all features of RL for robotics.
However, particularly in the context of real-world robotics, RL still suffers from limitations concerning machine safety and learning efficiency.

First, especially early in training, .
On physical systems, untrained policies may command high velocities, self-collisiding configurations, or torques exceeding joint limits, leading to wear and potential hardware damage.
Mitigating these risks requires external safeguards (e.g., watchdogs, safety monitors, emergency stops), often incuring in a high degree of human supervision.
Further, in the typical episodic setting considered in most robotics problems, experimentation is substantially slowed down by the need to manually reset the environment over the course of training, a time-consuming and brittle process.

Second, learning with a limited number of samples remains problematic in RL, .
Even strong algorithms such as SAC [@haarnojaSoftActorCriticOffPolicy2018] typically require a large numbers of transitions $\{ (s_t, a_t, r_t, s_\{t+1\}) \}_\{t=1\}^N$.
On hardware, generating these data is time-consuming and can even be prohibitive.

<ResponsiveImage src={Ch3DuckSimVsReal} alt="Simulated (left) vs. real-world (right) OpenDuck. Discrepancies in the simulation dynamics (*reality gap*) pose risks to policy transfer." id="fig-fig:synthetic-vs-real-duck" />

*Simulated (left) vs. real-world (right) OpenDuck. Discrepancies in the simulation dynamics (*reality gap*) pose risks to policy transfer.*

Training RL policies in simulation [@tobinDomainRandomizationTransferring2017] addresses both issues: it eliminates physical risk and dramatically increases throughput.
Yet, simulators require significant modeling effort, and rely on assumptions (simplified physical modeling, instantaneous actuation, static environmental conditions, etc.) limiting transferring policies learned in simulation due the discrepancy between real and simulated environments (*reality gap*, Figure Section fig:synthetic-vs-real-duck).
*Domain randomization* (DR) is a popular technique to overcome the reality gap, consisting in randomizing parameters of the simulated environment during training, to induce robustness to specific disturbances.
In turn, DR is employed to increase the diversity of scenarios over the course of training, improving on the chances sim-to-real transfer [@akkayaSolvingRubiksCube2019,antonovaReinforcementLearningPivoting2017,jiDribbleBotDynamicLegged2023].
In practice, DR is performed further parametrizing the *simulator*'s dynamics $\mathcal D \equiv \mathcal D_\xi$ with a *dynamics* (random) vector $\xi$ drawn an arbitrary distribution, $\xi \sim \Xi$.
Over the course of training---typically at each episode's reset---a new $\xi$ is drawn, and used to specify the environment's dynamics for that episode.
For instance, one could decide to randomize the friction coefficient of the surface in a locomotion task (Figure Section fig:ducks-on-terrains), or the center of mass of an object for a manipulation task.

<ResponsiveImage src={Ch3ManyDucks} alt="The same locomotion task can be carried out in different (simulated) domains (exemplified by the difference in terrains) at training time, resulting to increased robustness over diverse environment dynamics." id="fig-fig:ducks-on-terrains" />

*The same locomotion task can be carried out in different (simulated) domains (exemplified by the difference in terrains) at training time, resulting to increased robustness over diverse environment dynamics.*

While effective in transfering policies across the reality gap in real-world robotics [@tobinDomainRandomizationTransferring2017,akkayaSolvingRubiksCube2019, jiDribbleBotDynamicLegged2023,tiboniDomainRandomizationEntropy2024], DR often requires extensive manual engineering.
First, identifying which parameters to randomize---i.e., the *support* $\text\{supp\} (\Xi)$ of $\Xi$---is an inherently task specific process.
When locomoting over different terrains, choosing to randomize the friction coefficient is a reasonable choice, yet not completely resolutive as other factors (lightning conditions, external temperature, joints' fatigue, etc.) may prove just as important, making selecting these parameters yet another source of brittlness.

Selecting the dynamics distribution $\Xi$ is also non-trivial.
On the one hand, distributions with low entropy might risk to cause failure at transfer time, due to the limited robustness induced over the course of training.
On the other hand, excessive randomization may cause over-regularization and hinder performance.
Consequently, the research community investigated approaches to automatically select the randomization distribution $\Xi$, using signals from the training process or tuning it to reproduce observed real-world trajectories.
 @akkayaSolvingRubiksCube2019 use a parametric uniform distribution $\mathcal U(a, b)$ as $\Xi$, widening the bounds as training progresses and the agent's performance improves (AutoDR).
While effective, AutoDR requires significant tuning---the bounds are widened by a fixed, pre-specified amount $\Delta$---and may disregard data when performance *does not* improve after a distribution update [@tiboniDomainRandomizationEntropy2024].
 @tiboniDomainRandomizationEntropy2024 propose a similar method to AutoDR (DORAEMON) to evolve $\Xi$ based on training signal, but with the key difference of explicitly maximizing the entropy of parametric Beta distributions, inherently more flexible than uniform distributions.
DORAEMON proves particularly effective at dynamically increasing the entropy levels of the training distribution by employing a max-entropy objective, under performance constraints formulation.
Other approaches to automatic DR consist in specifically tuning $\Xi$ to align as much as possible the simulation and real-world domains.
For instance,  @chebotar2019closing interleave in-simulation policy training with repeated real-world policy rollouts used to adjust $\Xi$ based on real-world data, while  @tiboniDROPOSimtoRealTransfer2023 leverage a single, pre-collected set of real-world trajectories and tune $\Xi$ under a simple likelihood objective.

While DR has shown promise, it does not address the main limitation that, even under the assumption that an ideal distribution $\Xi$ to sample from was indeed available, many robotics problems in the first place.
Simulating contact-rich manipulation of possibly deformable or soft materials---i.e., *folding a piece of clothing*---can be costly and even time-intensive, limiting the benefits of in-simulation training.

A perhaps more foundamental limitation of RL for robotics is the general unavailability of complicated tasks' *dense* reward function, the design of which is essentially based on human expertise and trial-and-error.
In practice, *sparse* reward functions can be used to conclude whether one specific goal has been attained---*has this t-shirt been correctly folded?*---but unfortunately incur in more challenging learning.
As a result, despite notable successes, deploying RL directly on real-world robots at scale remains challenging.

To make the most of (1) the growing number of openly available datasets and (2) relatively inexpensive robots like the SO-100, RL could (1) be anchored in already-collected trajectories---limiting erratic and dangerous exploration---and (2) train in the real-world directly---bypassing the aforementioned issues with low-fidelity simulations.
In such a context, sample-efficient learning is also paramount, as training on the real-world is inherently time-bottlenecked.

Off-policy algorithms like Soft Actor-Critic (SAC) [@haarnojaSoftActorCriticOffPolicy2018] tend to be more sample efficient then their on-policy counterpart [@schulmanProximalPolicyOptimization2017], due to the presence a *replay buffer* used over the course of the training.
Other than allowing to re-use transitions $(s_t, a_t, r_t, s_\{t+1\})$ over the course of training, the replay buffer can also accomodate for the injection of previously-collected data in the training process [@ballEfficientOnlineReinforcement2023].
Using expert demonstrations to guide learning together with learned rewards, RL training can effectively be carried out in the real-world [@luoSERLSoftwareSuite2025].
Interestingly, when completed with in-training human interventions, real-world RL agents have been shown to learn policies with near-perfect success rates on challenging manipulation tasks in 1-2 hours [@luoPreciseDexterousRobotic2024].

#### Sample-efficient RL

In an MDP, the optimal policy $\pi^*$ can be derived from its associated Q-function, $Q_\{\pi^*\}$, and in particular the optimal action(s) $\mu(s_t)$ can be selected maximizing the optimal Q-function  over the action space,
$$\mu(s_t) = \max_\{a_t \in \mathcal A\} Q_\{\pi^*\}(s_t, a_t).$$
Interestingly, the Q\^*-function satisfies a recursive relationship (*Bellman equation*) based on a very natural intuition
[^1]:

> [\...] If the optimal value $Q^*(s_\{t+1\}, a_\{t+1\})$ of the [state] $s_\{t+1\}$ was known for all possible actions $a_\{t+1\}$, then the optimal strategy is to select the action $a_\{t+1\}$ maximizing the expected value of $r_t + \gamma Q^*(s_\{t+1\}, a_\{t+1\})$
> $$Q^*(s_t, a_t) = \mathbb E_\{s_\{t+1\} \sim \mathbb P(\bullet \vert s_t, a_t)\} \left[ r_t + \gamma \max_\{a_\{t+1\} \in \mathcal A\} Q^*(s_\{t+1\}, a_\{t+1\}) \big\vert s_t, a_t  \right]$$

In turn, the optimal Q-function  is guaranteed to be self-consistent by definition.
*Value-iteration* methods exploit this relationship (and/or its state-value counterpart, $V^*(s_t)$ ) by iteratively updating an initial estimate of Q\^*, $Q_k$ using the Bellman equation as update rule (*Q-learning*):
$$Q_\{i+1\}(s_t, a_t) \leftarrow \mathbb E_\{s_\{t+1\} \sim \mathbb P(\bullet \vert s_t, a_t)\} \left[ r_t + \gamma \max_\{a_\{t+1\} \in \mathcal A\} Q_i (s_\{t+1\}, a_\{t+1\}) \big\vert s_t, a_t  \right],  \quad i=0,1,2,\dots,K$$
Then, one can derive the (ideally, near-optimal) policy by explicitly maximizing over the action space the final (ideally, near-optimal) estimate $Q_K \approx Q^*$ at each timestep.
In fact, under certain assumptions on the MDP considered, $Q_K \to Q^* \, \text\{as \} K \to \infty$.

Effective in its early applications to small-scale discrete problems and theoretically sound, vanilla Q-learning was found complicated to scale to large $\mathcal\{S\} \times \mathcal\{A\}$ problems, in which the storing of $Q : \mathcal\{S\} \times \mathcal\{A\} \mapsto \mathbb R$ alone might result prohibitive.
Also, vanilla Q-learning is not directly usable for *continuous*, unstructured state-action space MPDs, such as those considered in robotics.
In their seminal work on *Deep Q-Learning* (DQN), @mnihPlayingAtariDeep2013 propose learning Q-values using deep convolutional neural networks, thereby accomodating for large and even unstructured *state* spaces.
DQN parametrizes the Q-function using a neural network with parameters $\theta$, updating the parameters by sequentially minimizing the expected squared temporal-difference error (TD-error, $\delta_i$):

$$\mathcal L(\theta_i) &= \mathbb E_\{(s_t, a_t) \sim \chi(\bullet)\}
    \big[
        (\underbrace\{y_i - Q_\{\theta_i\}(s_t, a_t)\}_\{\delta_i\})^2
    \big], 

    y_i &= \mathbb E_\{s_\{t+1\} \sim \mathbb P(\bullet \vert s_t, a_t)\} \big[ r_t + \gamma \max_\{a_t \in \mathcal A\} Q_\{\theta_\{i-1\}\} (s_\{t+1\}, a_\{t+1\}) \big], $$

Where $\chi$ represents a behavior distribution over state-action pairs.
Crucially, $\chi$ can in principle be different from the policy being followed, effectively allowing to reuse prior data stored in a *replay buffer* in the form of $(s_t, a_t, r_t, s_\{t+1\})$ transitions, used to form the TD-target $y_i$, TD-error $\delta_i$ and loss function the referenced figure via Monte-Carlo (MC) estimates.

While effective in handling large, unstructured state spaces for discrete action-space problems, DQN application's to continous control problems proved challenging.
Indeed, in the case of high-capacity function approximators such as neural networks, solving $\max_\{a_t \in \mathcal A\} Q_\theta(s_t, a_t)$ at each timestep is simply unfeasible due to the (1) continous nature of the action space ($\mathcal\{A\} \subset \mathbb R^n $ for some $ n$) and (2) impossibility to express the find a cheap (ideally, closed-form) solution to $Q_\theta $.
 @silverDeterministicPolicyGradient2014 tackle this fundamental challenge by using a *deterministic* function of the state $ s_t$ as policy, $\mu_\phi(s_t) = a_t$, parametrized by $\phi$. Thus, policies can be iteratively refined updating $\phi$ along the direction:
$$\begin\{equation\}

    d_\phi = \mathbb E_\{s_t \sim \mathbb P (\bullet)\} \left[ \nabla_\phi Q(s_t, a_t)\vert_\{a_t = \mu_\phi(s_t)\} \right] = \mathbb E_\{s_t \sim \mathbb P(\bullet)\} \left[ \nabla_\{a_t\} Q(s_t, a_t) \vert_\{a_t = \mu_\phi(s_t)\} \cdot \nabla_\phi \mu(s_t) \right]
\end\{equation\}$$
Provably, the referenced figure is the *deterministic policy gradient* (DPG) of the policy $\mu_\phi$ [@silverDeterministicPolicyGradient2014], so that updates $\phi_\{k+1\}\leftarrow \phi_k + \alpha d_\phi $ are guaranteed to increase the (deterministic) cumulative discounted reward, $ J(\mu_\phi)$.
 @lillicrapContinuousControlDeep2019 extended DPG to the case of (1) high-dimensional unstructured observations and (2) continuous action spaces, introducing Deep Deterministic Policy Gradient (DDPG), an important algorithm RL and its applications to robotics.
DDPG adopts a modified TD-target compared to the one defined in the referenced figure, by maintaining a policy network used to select actions, yielding
$$\begin{equation}

y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma Q_{\theta_{i-1}} (s_{t+1}, \mu_\phi(s_{t+1})) \big] .
\end{equation}$
Similarily to DQN, DDPG also employs the same replay buffer mechanism, to reuse past transitions over training for increased sample efficiency and estimate the loss function via MC-estimates.

Soft Actor-Critic (SAC) [@haarnojaSoftActorCriticOffPolicy2018] is a derivation of DDPG in the max-entropy (MaxEnt) RL framework, in which RL agents are tasked with .
MaxEnt RL [@haarnojaReinforcementLearningDeep2017] has proven particularly robust thanks to the development of diverse behaviors, incentivized by its entropy-regularization formulation.
In that, MaxEnt revisits the RL objective  J (\pi)  to specifically account for the policy entropy,

$J(\pi) &= \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \chi} \left[ r_t + \alpha \mathcal H(\pi (\bullet \vert s_t)) \right] $

This modified objective results in the *soft* TD-target:
  \begin\{equation\}

    y_i = \mathbb E_\{s_\{t+1\} \sim \mathbb P( \bullet \vert s_t, a_t)\} \left[ r_t + \gamma \left( Q_\{\theta_\{i-1\}\} (s_\{t+1\}, a_\{t+1\}) - \alpha \log \pi_\phi(a_\{t+1\} \vert s_\{t+1\}) \right) \right], \quad a_\{t+1\} \sim \pi_\phi(\bullet \vert s_t)
\end\{equation\}$
Similarily to DDPG, SAC also maintains an explicit policy, trained under the same MaxEnt framework for the maximization of the referenced figure, and updated using:
$$\begin\{equation\}

    \pi_\{k+1\} \leftarrow \arg\min_\{\pi^\prime \in \Pi\} \text\{D\}_\{\text\{KL\}\} \left(\pi^\prime (\bullet \vert s_t) \bigg\Vert \frac\{\exp(Q_\{\pi_k\}(s_t, \bullet))\}\{Z_\{\pi_k\}(s_t)\} \right)
\end\{equation\}$$
The update rule provided in the referenced figure optimizes the policy while projecting it on a set $\Pi$ of tractable distributions (e.g., Gaussians, @haarnojaReinforcementLearningDeep2017).

#### Sample-efficient, data-driven RL

Importantly, sampling $(s_t, a_t, r_t, s_\{t+1\})$ from the replay buffer $D $ conveniently allows to approximate the previously introduced expectations for TD-target and TD-error through Monte-Carlo (MC) estimates.
The replay buffer $ D$ also proves extremely useful in maintaining a history of previous transitions and using it for training, improving on sample efficiency.
Furthermore, it also naturally provides an entry point to inject offline trajectories recorded, for instance, by a human demonstrator, into the training process.

Reinforcement Learning with Prior Data (RLPD) [@ballEfficientOnlineReinforcement2023] is an Offline-to-Online RL algorithm leveraging prior data to effectively accelerate the training of a SAC agent.
Unlike previous works on Offline-to-Online RL, RLPD avoids any pre-training and instead uses the available offline data $D_\text\{offline\}$ to improve online-learning from scratch.
During each training step, transitions from both the offline and online replay buffers are sampled in equal proportion, and used in the underlying SAC routine.

#### Sample-efficient, data-driven, real-world RL

Despite the possibility to leverage offline data for learning, the effectiveness of real-world RL training is still limited by the need to define a task-specific, hard-to-define reward function.
Further, even assuming to have access to a well-defined reward function, typical robotics pipelines rely mostly on propioperceptive inputs augmented by camera streams of the environment.
As such, even well-defined rewards would need to be derived from processed representations of unstructured observations, introducing brittleness.
In their technical report, @luoSERLSoftwareSuite2025 empirically address the needs (1) to define a reward function and (2) to use it on image observations, by introducing a series of tools to allow for streamlined training of *reward classifiers* $c $, as well as jointly learn forward-backward controllers to speed up real-world RL.
Reward classifiers are particularly useful in treating complex tasks---e.g., folding a t-shirt---for which a precise reward formulation is arbitrarily complex to obtain, or that do require significant shaping and are more easily learned directly from demonstrations of success ($ e^+$) or failure ($e^-$) states, $s \in \mathcal\{S\}$, with a natural choice for the state-conditioned reward function being $r \mathcal S \mapsto \mathbb R $ being $ r(s) = \log c(e^+ \ vert s )$.
Further, @luoSERLSoftwareSuite2025 demonstrate the benefits of learning *forward* (executing the task from initial state to completion) and *backward* (resetting the environment to the initial state from completion) controllers, parametrized by separate policies.

Lastly, in order to improve on the robustness of their approach to different goals while maintaing practical scalability, @luoSERLSoftwareSuite2025 introduced a modified state and action space, expressing proprioperceptive configurations $q$ and actions $\dot q $ in the frame of end-effector pose at $ t=0$.
Randomizing the initial pose of the end-effector ($s_0$),@luoSERLSoftwareSuite2025 achieved a similar result to that of having to manually randomize the environment at every timestep, but with the benefit of maintaining the environment in the same condition across multiple training episodes, achieving higher scalability of their method thanks to the increased practicality of their approach.

<ResponsiveImage src={Ch3HilSerlExamples} alt="(A) HIL-SERL allows for real-world training of high performance RL agents by building on top advancements presented by of SAC, RLPD and SERL. (B) Example of human intervention during a HIL-SERL training process on a SO-100." id="fig-fig:hil-serl-blocks" />

*(A) HIL-SERL allows for real-world training of high performance RL agents by building on top advancements presented by of SAC, RLPD and SERL. (B) Example of human intervention during a HIL-SERL training process on a SO-100.*

Building on off-policy deep Q-learning with replay buffers, entropy regularization for better exploration and performance, expert demonstrations to guide learning, and a series of tools and recommendations for real-world training using reward classifiers (Figure Section fig:hil-serl-blocks), @luoPreciseDexterousRobotic2024 introduce human interactions during training, learning near-optimal policies in challenging real-world manipulation tasks in 1-2 hours.

Human in the Loop Sample Efficient Robot reinforcement Learning (HIL-SERL) [@luoPreciseDexterousRobotic2024] augments offline-to-online RL with targeted human corrections during training, and employs prior data to (1) train a reward classifier and (2) bootstrap RL training on expert trajectories.
While demonstrations provide the initial dataset seeding learning and constraining early exploration, interactive corrections allow a human supervisor to intervene on failure modes and supply targeted interventions to aid the learning process.
Crucially, human interventions are stored in both the offline and online replay buffers, differently from the autonomous transitions generated at training time and stored in the online buffer only.
Consequently, given an intervention timestep $k \in (0, T)$, length-$K$ human intervention data $\{ s^\{\text\{human\}\}_k, a^\{\text\{human\}\}_k, r^\{\text\{human\}\}_k, s^\{\text\{human\}\}_\{k+1\},\}_\{k=1\}^K$ is more likely to be sampled for off-policy learning than the data generated online during training, providing stronger supervision to the agent while still allowing for autonomous learning.
Empirically, HIL-SERL attains near-perfect success rates on diverse manipulation tasks within 1-2 hours of training [@luoPreciseDexterousRobotic2024], underscoring how offline datasets with online RL can markedly improve stability and data efficiency, and ultimately even allow real-world RL-training.

### Code Example: Real-world RL

**TODO(fracapuano): work out rl training example**

### Limitations of RL in Real-World Robotics: Simulators and Reward Design

Despite the advancements in real-world RL training, solving robotics training RL agents in the real world still suffers from the following limitations:

- In those instances where real-world training experience is prohibitively expensive to gather [@degraveMagneticControlTokamak2022, bellemareAutonomousNavigationStratospheric2020], in-simulation training is often the only option. However, high-fidelity simulators for real-world problems can be difficult to build and maintain, especially for contact-rich manipulation and tasks involving deformable or soft materials.

- Reward design poses an additional source of brittleness. Dense shaping terms are often required to guide exploration in long-horizon problems, but poorly tuned terms can lead to specification gaming or local optima. Sparse rewards avoid shaping but exacerbate credit assignment and slow down learning. In practice, complex behaviors require efforts shaping rewards: a britlle and error prone process.

Advances in Behavioral Cloning (BC) from corpora of human demonstrations address both of these concerns.
By learning in a supervised fashion to reproduce expert demonstrations, BC methods prove competitive while bypassing the need for simulated environments and hard-to-define reward functions.

[^1]: Quote from @mnihPlayingAtariDeep2013. The notation used has slightly been adapted for consistency with the rest of this tutorial.