diff --git "a/app/scripts/latex-to-markdown/output/main.mdx" "b/app/scripts/latex-to-markdown/output/main.mdx" --- "a/app/scripts/latex-to-markdown/output/main.mdx" +++ "b/app/scripts/latex-to-markdown/output/main.mdx" @@ -1,7 +1,21 @@ --- -title: "Research Article" -description: "Converted from LaTeX to MDX" -date: "2025-09-18" +title: "Robot Learning: A Tutorial" +authors: + - name: "Francesco Capuano" + affiliations: [1, 2] + - name: "Adil Zouitine" + affiliations: [2] + - name: "Pepijn Kooijmans" + affiliations: [2] + - name: "Thomas Wolf" + affiliations: [2] + - name: "Michel Aractingi" + affiliations: [2] +affiliations: + - name: "École Normale Supérieure Paris-Saclay" + - name: "Hugging Face" +published: "Sep 18, 2025" +tableOfContentsAutoCollapse: true --- import ResponsiveImage from '../components/ResponsiveImage.astro'; @@ -47,7 +61,7 @@ import ch4_queues from '../assets/image/figures/ch4/ch4-queues.png'; import ch5_pi0_sampling_timesteps from '../assets/image/figures/ch5/ch5-pi0-sampling-timesteps.png'; -# Foreword +## Foreword Robotics is an inherently multidisciplinary field, and is not witnessing unprecedented advancements since its inception in the 1960s. Yet, more than sixty years after the debut of Unimate, robots have still not fully integrated into the rich, unstructured, and dynamic world we humans inhabit. Over the decades, numerous disciplines have shown immense promise in tackling the challenges of creating autonomous systems. This tutorial takes a clear stance in the debate on whether modern Machine Learning can play a pivotal role in the development of autonomous robot systems: we believe this to be the case. @@ -65,7 +79,7 @@ Instead, our goal here is to provide an intuitive explanation as per why these d We sincerely hope this tutorial serves as a valuable starting point for your journey into robot learning. -# Introduction +## Introduction -# Classical Robotics +## Classical Robotics
@@ -206,7 +220,7 @@ TL;DR Learning-based approaches to robotics are motivated by the need to (1) gen
-## Explicit and Implicit Models +### Explicit and Implicit Models dynamics-based[^1] methods, leveraging precise descriptions of the mechanics of robots’ rigid bodies and their interactions with eventual obstacles in the environment--to *implicit* models--learning-based methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings @agrawalComputationalSensorimotorLearning, @bekrisStateRobotMotion2024. A variety of methods have been developed between these two extrema. For instance,  @hansenTemporalDifferenceLearning2022 show how learning-based systems can benefit from information on the physics of problems, complementing a traditional learning method such as Temporal Difference (TD)-learning @suttonReinforcementLearningIntroduction2018 with Model-Predictive Control (MPC). Conversely, as explicit models may be relying on assumptions proving overly simplistic--or even unrealistic--in practice, learning can prove effective to improve modeling of complex phenomena or complement perception @mccormacSemanticFusionDense3D2016. Such examples aim at demonstrating the richness of approaches to robotics, and Figure 2 graphically illustrates some of the most relevant techniques. Such a list is clearly far from being exhaustive, and we refer to @bekrisStateRobotMotion2024 for a more comprehensive overview of both general and application-specific methods for motion generation. In this section, we wish to introduce the inherent benefits of learning-based approaches to robotics--the core focus on this tutorial. -## Different Types of Motion +### Different Types of Motion Considering the (toy) example presented in Figure 6, then we can analytically write the end-effector’s position $p \in \mathbb R^2$ as a function of the robot’s configuration, $p = p(q), p: \mathcal Q \mapsto \mathbb R^2$. In particular, we have: + $$ -`p(q) = -\begin -p_x(\theta_1, \theta_2)\\ - p_y(\theta_1, \theta_2) -\end{pmatrix} -= -\begin{pmatrix} -l \cos(\theta_1) + l \cos(\theta_1 + \theta_2)\\ - l \sin(\theta_1) + l \sin(\theta_1 + \theta_2) -\end{pmatrix} -\in S^{n=2}_{l_1+l_2} = \{ p(q) \in \mathbb R^2: \Vert p(q) \Vert_2^2 \leq (2l)^2, \ \forall q \in \mathcal Q \}` +`p(q) = \begin{pmatrix} p_x(\theta_1, \theta_2)\\ p_y(\theta_1, \theta_2) \end{pmatrix} = \begin{pmatrix} l \cos(\theta_1) + l \cos(\theta_1 + \theta_2)\\ l \sin(\theta_1) + l \sin(\theta_1 + \theta_2) \end{pmatrix} \in S^{n=2}_{l_1+l_2} = \{ p(q) \in \mathbb R^2: \Vert p(q) \Vert_2^2 \leq (2l)^2, \ \forall q \in \mathcal Q \}` $$ + Deriving the end-effector’s *pose*--position *and* orientation--in some $m$-dimensional space $\boldsymbol{p} \in \mathcal{P} \subset \mathbb{R}^{m}$ starting from the configuration ${\textnormal{q}}\in \mathcal Q \subset \mathbb R^n$ of a $n$-joints robot is referred to as *forward kinematics* (FK), whereas identifying the configuration corresponding to any given target pose is termed *inverse kinematics* (IK). In that, FK is used to map a robot configuration into the corresponding end-effector pose, whereas IK is used to reconstruct the configuration(s) given an end-effector pose. In the simplified case here considered (for which $\boldsymbol{p} \equiv p$, as the orientation of the end-effector is disregarded for simplicity), one can solve the problem of controlling the end-effector’s location to reach a goal position $p^*$ by solving analytically for $q: p(q) = f_{\text{FK}}(q) = p^*$. However, in the general case, one might not be able to solve this problem analytically, and can typically resort to iterative optimization methods comparing candidate solutions using a loss function (in the simplest case, $\Vert p(q) - p^* \Vert_2^2$ is a natural candidate), yielding: @@ -339,7 +345,7 @@ Unlike eq. 10) motivate the exploration of learning-based approaches that can (1) integrate perception and control more tightly, (2) adapt across tasks and embodiments with reduced expert modeling interventions and (3) scale gracefully in performance as more robotics data becomes available. -# Robot (Reinforcement) Learning +## Robot (Reinforcement) Learning
@@ -442,7 +448,7 @@ Figure 13 depicts two of such cases. Reaching for an object to move somewhere else in the scene is an indeed sequential problem where at each cycle the controller needs to adjust the position of the robotic arm based on their current configuration and the (possibly varying) position of the object. Figure 13 also shows an example of a locomotion problem, where sequentiality is inherent in the problem formulation. While sliding to the side, the controller has to constantly keep adjusting to the robot’s propioperception to avoid failure (falling). -## A (Concise) Introduction to RL +### A (Concise) Introduction to RL The RL framework @suttonReinforcementLearningIntroduction2018, which we briefly introduce here, has often been used to model robotics problems @koberReinforcementLearningRobotics. RL is a subfield within ML fundamentally concerned with the development of autonomous systems (*agents*) learning how to *continuously behave* in an evolving environment, developing (ideally, well-performing) control strategies (*policies*). Crucially for robotics, RL agents can improve via trial-and-error only, thus entirely bypassing the need to develop explicit models of the problem dynamics, and rather exploiting interaction data only. In RL, this feedback loop (Figure 14) between actions and outcomes is established through the agent sensing a scalar quantity (*reward*). @@ -478,7 +484,12 @@ A length-$T$ *trajectory* is the (random) sequence \end{equation} ``` with per-step rewards defined as $r_t = r (s_t, a_t, s_{t+1})$ for ease of notation.Interestingly, assuming both the environment dynamics and conditional distribution over actions given states--the *policy*--to be *Markovian*: -$$\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb P (s_{t+1}\vert s_t, a_t)\\ \mathbb P(a_t\vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) = \mathbb P(a_t\vert s_t) $$The probability of observing a given trajectory$\tau$ factorizes into + +$$ +`\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb P (s_{t+1}\vert s_t, a_t)\\ \mathbb P(a_t\vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) = \mathbb P(a_t\vert s_t) ` +$$ + + The probability of observing a given trajectory $\tau$ factorizes into ``` math \begin{equation} @@ -486,13 +497,18 @@ $$\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb \end{equation} ``` -Policies $\mathbb P(a_t\vert s_t)$ are typically indicated as $\pi(a_t\vert s_t)$, and often parametrized via $\theta$, yielding $\pi_\theta (a_t\vert s_t)$. Policies are trained optimizing the (discounted) *return* associated to a given $\tau$, i.e. the (random) sum of measured rewards over trajectory: ``` math G(\tau) = \sum_{t=0}^{T-1} \gamma^{t} r_t. ``` In that, agents seek to learn control strategies (*policies*,$\pi_\theta$) maximizing the expected return $\mathbb E_{\tau \sim \pi_\theta} G(\tau)$. For a given dynamics $\mathcal D$--i.e., for a given problem--taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies $\Pi$, yielding the (maximization) target $J : \Pi \mapsto \mathbb R$ +Policies $\mathbb P(a_t\vert s_t)$ are typically indicated as $\pi(a_t\vert s_t)$, and often parametrized via $\theta$, yielding $\pi_\theta (a_t\vert s_t)$. Policies are trained optimizing the (discounted) *return* associated to a given $\tau$, i.e. the (random) sum of measured rewards over trajectory: +``` math +G(\tau) = \sum_{t=0}^{T-1} \gamma^{t} r_t. +``` +In that, agents seek to learn control strategies (*policies*, $\pi_\theta$) maximizing the expected return $\mathbb E_{\tau \sim \pi_\theta} G(\tau)$. For a given dynamics $\mathcal D$--i.e., for a given problem--taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies $\Pi$, yielding the (maximization) target $J : \Pi \mapsto \mathbb R$ + $$ -`J(\pi_\theta) = \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} [G(\tau)],\\ - \mathbb P_{\theta; \mathcal D} (\tau) = \rho \prod_{t=0}^{T-1} \mathcal D (s_t, a_t, s_{t+1})\ \pi_\theta (a_t\vert s_t).` +`J(\pi_\theta) = \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} [G(\tau)],\\ \mathbb P_{\theta; \mathcal D} (\tau) = \rho \prod_{t=0}^{T-1} \mathcal D (s_t, a_t, s_{t+1})\ \pi_\theta (a_t\vert s_t).` $$ + Because in the RL framework the agent is assumed to only be able to observe the environment dynamics and not to intervene on them, [eq:RL-j-function] varies exclusively with the policy followed. In turn, MDPs naturally provide a framework to optimize over the space of the possible behaviors an agent might enact ($\pi \in \Pi$), searching for the *optimal policy* $\pi^* = \arg \max_{\theta} J(\pi_\theta)$, where $\theta$ is the parametrization adopted by the policy set $\Pi: \pi_\theta \in \Pi, \ \forall \theta$. Other than providing a target for policy search, $G(\tau)$ can also be used as a target to discriminate between states and state-action pairs. Given any state $s \in \mathcal S$--e.g., a given configuration of the robot--the *state-value* function ``` math V_\pi(s) = \mathbb E_{\tau \sim \pi} [G(\tau) \big \vert s_0 = s] @@ -502,7 +518,12 @@ can be used to discriminate between desirable and undesirable state in terms of Q_\pi(s,a) = \mathbb E_{\tau \sim \pi} [G (\tau) \big \vert s_0 = s, a_0=a] ``` Crucially, value functions are interrelated: -$$Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}\sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma V_\pi(s_{t+1})]\\ V_\pi(s_t) = \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)] $$Inducing an ordering over states and state-action pairs under$\pi$, value functions are central to most RL algorithms. A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure 15). + +$$ +`Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}\sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma V_\pi(s_{t+1})]\\ V_\pi(s_t) = \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)] ` +$$ + + Inducing an ordering over states and state-action pairs under $\pi$, value functions are central to most RL algorithms. A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure 15). [eq:dqn-loss] via Monte-Carlo (MC) estimates. + + Where $\chi$ represents a behavior distribution over state-action pairs. Crucially, $\chi$ can in principle be different from the policy being followed, effectively allowing to reuse prior data stored in a *replay buffer* in the form of $(s_t, a_t, r_t, s_{t+1})$ transitions, used to form the TD-target $y_i$, TD-error $\delta_i$ and loss function [eq:dqn-loss] via Monte-Carlo (MC) estimates. While effective in handling large, unstructured state spaces for discrete action-space problems, DQN application’s to continous control problems proved challenging. Indeed, in the case of high-capacity function approximators such as neural networks, solving $\max_{a_t \in \mathcal A} Q_\theta(s_t, a_t)$ at each timestep is simply unfeasible due to the (1) continous nature of the action space ($\mathcal A\subset \mathbb R^n$ for some $n$) and (2) impossibility to express the find a cheap (ideally, closed-form) solution to $Q_\theta$.  @silverDeterministicPolicyGradient2014 tackle this fundamental challenge by using a *deterministic* function of the state $s_t$ as policy, $\mu_\phi(s_t) = a_t$, parametrized by $\phi$. Thus, policies can be iteratively refined updating $\phi$ along the direction: ``` math @@ -618,13 +638,13 @@ Similarily to DDPG, SAC also maintains an explicit policy, trained under the sam ``` The update rule provided in [eq:sac-policy-update] optimizes the policy while projecting it on a set $\Pi$ of tractable distributions (e.g., Gaussians, @haarnojaReinforcementLearningDeep2017). -#### Sample-efficient, data-driven RL +##### Sample-efficient, data-driven RL Importantly, sampling $(s_t, a_t, r_t, s_{t+1})$ from the replay buffer $D$ conveniently allows to approximate the previously introduced expectations for TD-target and TD-error through Monte-Carlo (MC) estimates. The replay buffer $D$ also proves extremely useful in maintaining a history of previous transitions and using it for training, improving on sample efficiency. Furthermore, it also naturally provides an entry point to inject offline trajectories recorded, for instance, by a human demonstrator, into the training process. Reinforcement Learning with Prior Data (RLPD) @ballEfficientOnlineReinforcement2023 is an Offline-to-Online RL algorithm leveraging prior data to effectively accelerate the training of a SAC agent. Unlike previous works on Offline-to-Online RL, RLPD avoids any pre-training and instead uses the available offline data $D_\text{offline}$ to improve online-learning from scratch. During each training step, transitions from both the offline and online replay buffers are sampled in equal proportion, and used in the underlying SAC routine. -#### Sample-efficient, data-driven, real-world RL +##### Sample-efficient, data-driven, real-world RL Despite the possibility to leverage offline data for learning, the effectiveness of real-world RL training is still limited by the need to define a task-specific, hard-to-define reward function. Further, even assuming to have access to a well-defined reward function, typical robotics pipelines rely mostly on propioperceptive inputs augmented by camera streams of the environment. As such, even well-defined rewards would need to be derived from processed representations of unstructured observations, introducing brittleness. In their technical report, @luoSERLSoftwareSuite2025 empirically address the needs (1) to define a reward function and (2) to use it on image observations, by introducing a series of tools to allow for streamlined training of *reward classifiers* $c$, as well as jointly learn forward-backward controllers to speed up real-world RL. Reward classifiers are particularly useful in treating complex tasks--e.g., folding a t-shirt--for which a precise reward formulation is arbitrarily complex to obtain, or that do require significant shaping and are more easily learned directly from demonstrations of success ($e^+$) or failure ($e^-$) states, $s \in \mathcal S$, with a natural choice for the state-conditioned reward function being $r \mathcal S \mapsto \mathbb R$ being $r(s) = \log c(e^+ \ vert s )$. Further, @luoSERLSoftwareSuite2025 demonstrate the benefits of learning *forward* (executing the task from initial state to completion) and *backward* (resetting the environment to the initial state from completion) controllers, parametrized by separate policies. @@ -644,11 +664,11 @@ Building on off-policy deep Q-learning with replay buffers, entropy regularizati Human in the Loop Sample Efficient Robot reinforcement Learning (HIL-SERL) @luoPreciseDexterousRobotic2024 augments offline-to-online RL with targeted human corrections during training, and employs prior data to (1) train a reward classifier and (2) bootstrap RL training on expert trajectories. While demonstrations provide the initial dataset seeding learning and constraining early exploration, interactive corrections allow a human supervisor to intervene on failure modes and supply targeted interventions to aid the learning process. Crucially, human interventions are stored in both the offline and online replay buffers, differently from the autonomous transitions generated at training time and stored in the online buffer only. Consequently, given an intervention timestep $k \in (0, T)$, length-$K$ human intervention data $\{ s^{\text{human}}_k, a^{\text{human}}_k, r^{\text{human}}_k, s^{\text{human}}_{k+1},\}_{k=1}^K$ is more likely to be sampled for off-policy learning than the data generated online during training, providing stronger supervision to the agent while still allowing for autonomous learning. Empirically, HIL-SERL attains near-perfect success rates on diverse manipulation tasks within 1-2 hours of training @luoPreciseDexterousRobotic2024, underscoring how offline datasets with online RL can markedly improve stability and data efficiency, and ultimately even allow real-world RL-training. -### Code Example: Real-world RL +#### Code Example: Real-world RL **TODO(fracapuano): work out rl training example** -### Limitations of RL in Real-World Robotics: Simulators and Reward Design +#### Limitations of RL in Real-World Robotics: Simulators and Reward Design Despite the advancements in real-world RL training, solving robotics training RL agents in the real world still suffers from the following limitations: @@ -658,7 +678,7 @@ Despite the advancements in real-world RL training, solving robotics training RL Advances in Behavioral Cloning (BC) from corpora of human demonstrations address both of these concerns. By learning in a supervised fashion to reproduce expert demonstrations, BC methods prove competitive while bypassing the need for simulated environments and hard-to-define reward functions. -# Robot (Imitation) Learning +## Robot (Imitation) Learning
@@ -721,11 +741,11 @@ Despite the inherent challenges of learning on non-i.i.d. data, the BC formulati While conceptually elegant, point-estimate policies $f : \mathcal O\mapsto \mathcal A$ learned by solving [eq:loss-minimization-SL] have been observed to suffer from (1) compounding errors @rossReductionImitationLearning2011 and (2) poor fit to multimodal distributions @florenceImplicitBehavioralCloning2022, @keGraspingChopsticksCombating2020. Figure 21 illustrates these two key issues related to learning *explicit policies* @florenceImplicitBehavioralCloning2022. Besides sequentiality in $\mathcal D$, compounding errors due to *covariate shift* may also prove catastrophic, as even small $\epsilon$-prediction errors $0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon$ can quickly drive the policy into out-of-distribution states, incuring in less confident generations and thus errors compounding (Figure 21, left).Moreover, point-estimate policies typically fail to learn *multimodal* targets, which are very common in human demonstrations solving robotics problems, since multiple trajectories can be equally as good towards the accomplishment of a goal (e.g., symmetric grasps, Figure 21, right). In particular, unimodal regressors tend to average across modes, yielding indecisive or even unsafe commands @florenceImplicitBehavioralCloning2022. To address poor multimodal fitting, @florenceImplicitBehavioralCloning2022 propose learning the generative model $p(o, a)$ underlying the samples in $\mathcal D$, rather than an explicitly learning a prediction function $f(o) = a$. -## A (Concise) Introduction to Generative Models +### A (Concise) Introduction to Generative Models Generative Models (GMs) aim to learn the stochastic process underlying the very generation of the data collected, and typically do so by fitting a probability distribution that approximates the unknown *data distribution*, $p$. In the case of BC, this unknown data distribution $p$ represents the expert’s joint distribution over $(o, a)$-pairs. Thus, given a finite set of $N$ pairs $\mathcal D = \{ (o,a)_i \}_{i=0}^N$ used as an imitation learning target (and thus assumed to be i.i.d.), GM seeks to learn a *parametric* distribution $p_\theta(o,a)$ such that (1) new samples $(o,a) \sim p_\theta(\bullet)$ resemble those stored in $\mathcal D$, and (2) high likelihood is assigned to the observed regions of the unobservable $p$. Likelihood-based learning provides a principled training objective to achieve both objectives, and it is thus extensively used in GM @prince2023understanding. -### Variational Auto-Encoders +#### Variational Auto-Encoders Given a dataset $\mathcal D$ consisting of $N$ i.i.d. observation-action pairs, the log-likelihood of all datapoints under $\theta$ (in Bayesian terms, the *evidence* $p_\theta(\mathcal D)$) can thus be written as: + $$ -`\log p_\theta(\mathcal D) = \log \sum_{i=0}^N p_\theta ((o,a)_i)\\ - = \log \sum_{i=0}^N \int_{\text{supp}({Z})} p_\theta((o,a)_i \vert z) p(z)\\ - = \log \sum_{i=0}^N \int_{\text{supp}({Z})} \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z) p(z)\\ - = \log \sum_{i=0}^N \mathbb E_{z \sim p_\theta(\bullet \vert (o,a)_i)} [\frac{p(z)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z)], ` +`\log p_\theta(\mathcal D) = \log \sum_{i=0}^N p_\theta ((o,a)_i)\\ = \log \sum_{i=0}^N \int_{\text{supp}({Z})} p_\theta((o,a)_i \vert z) p(z)\\ = \log \sum_{i=0}^N \int_{\text{supp}({Z})} \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z) p(z)\\ = \log \sum_{i=0}^N \mathbb E_{z \sim p_\theta(\bullet \vert (o,a)_i)} [\frac{p(z)}{q_\theta(z \vert (o,a)_i)} \cdot p_\theta((o,a)_i \vert z)], ` $$ + where we used [eq:BC-latent-variable] in [eq:evidence-definition-1], multiplied by $1 = \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)}$ in [eq:evidence-definition-2], and used the definition of expected value in [eq:evidence-definition]. In the special case where one assumes distributions to be tractable, $p_\theta (\mathcal D)$ is typically tractable too, and $\max_\theta \log p_\theta(\mathcal D)$ provides a natural target for (point-wise) infering the unknown parameters $\theta$ of the generative model. Unfortunately, [eq:evidence-definition] is rarely tractable when the distribution $p$ is modeled with approximators such as neural networks, especially for high-dimensional, unstructured data. @@ -770,31 +789,23 @@ In the special case where one assumes distributions to be tractable, $p_\theta ( In their seminal work on Variational Auto-Encoders (VAEs), @kingmaAutoEncodingVariationalBayes2022 present two major contributions to learn complex latent-variable GMs on unstructured data, proposing (1) a tractable, variational lower-bound to [eq:evidence-definition] as an optimization target to jointly learn likelihood and posterior and (2) high-capacity function approximators to model the likelihood $p_\theta(o,a\vert z)$ and (approximate) posterior distribution $q_\phi(z \vert o,a) \approx q_\theta(z \vert o,a)$. In particular, the lower bound on [eq:evidence-definition] (Evidence LOwer Bound, *ELBO*) can be derived from [eq:evidence-definition] applying Jensen’s inequality--$\log \mathbb{E}[\bullet] \geq \mathbb{E} [\log (\bullet)]$--yielding: + +$$ +`\log p_\theta(\mathcal D) \geq \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] + \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} [\log \left( \frac{p(z)}{q_\theta(z \vert (o,a)_i)} \right)] \right)\\ = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - \text{D}_{\text{KL}}\big[ q_\theta(z \vert (o,a)_i) \Vert p(z) \big] \right) ` $$ -`\log p_\theta(\mathcal D) \geq \sum_{i=0}^{N} \left( - \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - + \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} [\log \left( \frac{p(z)}{q_\theta(z \vert (o,a)_i)} \right)] - \right)\\ - = \sum_{i=0}^{N} \left( - \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - - \text{D}_{\text{KL}}\big[ q_\theta(z \vert (o,a)_i) \Vert p(z) \big] - \right) ` -$$The true, generally intractable posterior$p_\theta (z \vert o,a)$ prevents computing both the expectation and KL divergence terms in [eq:ELBO-intractable], and therefore @kingmaAutoEncodingVariationalBayes2022 propose deriving the ELBO using an *approximate* posterior $q_\phi(z \vert o,a)$, resulting in the final, tractable ELBO objective, $\text{ELBO}_{\mathcal D}(\theta, \phi) = \sum_{i=0}^{N} \left( - \mathbb{E}_{z \sim q_\phi(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - - \text{D}_{\text{KL}}\big[ q_\phi(z \vert (o,a)_i) \Vert p(z) \big] - \right) - $ From Jensen’s inequality, maximizing ELBO results in maximizing the log-likelihood of the data too, thus providing a natural, tractable optimization target. Indeed, expectations can be estimated using MC estimates from the learned distributions in [eq:ELBO], while the KL-divergence term can typically be computed in closed-form (1) modeling $q_\phi$ as a Gaussian $q_\phi(z \vert o,a) = \mathcal N\big(\mu_\phi(o,a), \Sigma_\phi(o,a) \big)$ and (2) imposing a standard Gaussian prior on the latent space, $p(z) = \mathcal N(\mathbf{0}, \mathbf{I})$. + + The true, generally intractable posterior $p_\theta (z \vert o,a)$ prevents computing both the expectation and KL divergence terms in [eq:ELBO-intractable], and therefore @kingmaAutoEncodingVariationalBayes2022 propose deriving the ELBO using an *approximate* posterior $q_\phi(z \vert o,a)$, resulting in the final, tractable ELBO objective, $\text{ELBO}_{\mathcal D}(\theta, \phi) = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim q_\phi(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - \text{D}_{\text{KL}}\big[ q_\phi(z \vert (o,a)_i) \Vert p(z) \big] \right) $ From Jensen’s inequality, maximizing ELBO results in maximizing the log-likelihood of the data too, thus providing a natural, tractable optimization target. Indeed, expectations can be estimated using MC estimates from the learned distributions in [eq:ELBO], while the KL-divergence term can typically be computed in closed-form (1) modeling $q_\phi$ as a Gaussian $q_\phi(z \vert o,a) = \mathcal N\big(\mu_\phi(o,a), \Sigma_\phi(o,a) \big)$ and (2) imposing a standard Gaussian prior on the latent space, $p(z) = \mathcal N(\mathbf{0}, \mathbf{I})$. An intuitive explanation of the learning dynamics of VAEs can be given considering the equivalent case of *minimizing the negative ELBO*, which admits a particularly interpretable factorization + $$ -`\min_{\theta, \phi} - \text{ELBO}_{\mathcal (o,a) \sim \mathcal D}(\theta, \phi) = \min_{\theta, \phi}\mathbf{L^{\text{rec}}}(\theta) + \mathbf{L^{\text{reg}}}(\phi)\\ - \mathbf{L^{\text{rec}}}(\theta) = \mathbb{E}_{z \sim q_\phi(\cdot \vert o,a} \big[ \log p_\theta(o,a \vert z) \big]\\ - \mathbf{L^{\text{reg}}}(\phi) = \text{D}_{\text{KL}}\big[ q_\phi(z \vert o,a) \Vert p(z) \big] ` +`\min_{\theta, \phi} - \text{ELBO}_{\mathcal (o,a) \sim \mathcal D}(\theta, \phi) = \min_{\theta, \phi}\mathbf{L^{\text{rec}}}(\theta) + \mathbf{L^{\text{reg}}}(\phi)\\ \mathbf{L^{\text{rec}}}(\theta) = \mathbb{E}_{z \sim q_\phi(\cdot \vert o,a} \big[ \log p_\theta(o,a \vert z) \big]\\ \mathbf{L^{\text{reg}}}(\phi) = \text{D}_{\text{KL}}\big[ q_\phi(z \vert o,a) \Vert p(z) \big] ` $$ + For any given $(o,a)$ pair, the expected value term of [eq:VAE-Lrec] is typically computed via MC estimates, resulting in ``` math -\mathbb{E}_{z \sim q_\phi(\bullet \vert o,a)} \big[ \log p_\theta(o,a \vert z) \big] = \mathbf{L^{\text{rec}}} \approx - \frac{1}{n} \sum_{i=0}^n \log p_\theta(o,a \vert z_i). @@ -805,13 +816,14 @@ Assuming $p_\theta(o,a \vert z)$ is parametrized as an isotropic Gaussian distri ``` Indeed, it is very common in practice to approximate from the learned likelihood $p_\theta(o,a \vert z)$ as a parametric distribution (e.g. Gaussians) parametrized by some learned vector of coefficients derived from $\mu_\theta (z), \ z \sim p (\bullet)$. In all such cases, learning a VAE corresponds to optimally *reconstructing* the examples in $\mathcal D$ by minimizing the L2-error--a very common *supervised learning* objective for regression targets--while regularizing the information compression into the latent, as under the common modeling choice $p(z) = \mathcal N (\mathbf{0}, \mathbf{I})$ [eq:VAE-Lreg] regularizes the posterior limiting the expressivity of $q_\phi(z\vert o,a)$. -### Diffusion Models +#### Diffusion Models VAEs approximate probability distributions via a *single* latent variable model, assuming the underlying unknown distribution can be factored according to [eq:BC-latent-variable], and solve the variational inference problem of jointly learning the likelihood $p_\theta$ and (approximate) posterior $q_\phi$ for such model. In that, the unknown data distribution $p(o,a)$ is effectively approximated via $\int_Z p(z) p_\theta(o,a \vert z)$, and the underlying generative process reproduced by (1) sampling a latent variable and (2) learning to decode it into a (ideally) high-likelihood sample under the (unknown) $p(o,a)$. Diffusion Models (DMs) @hoDenoisingDiffusionProbabilistic2020 are another class of GMs which treat the similar problem of approximating an underlying unknown data distribution--*variational inference*--by *partially* extending VAEs to the case where *multiple* latent variables influence each other and the generative process underlying $o,a$ itself. In particular, DMs posit the generative process can be decomposed to a series of piece-wise (Markovian) interactions between (latent) variables (Figure 24), resulting in + $$ -`p(\underbrace{o,a}_{= z_0}) = \int_{\text{supp}({Z_0})} \int_{\text{supp}({Z_1})} \ldots \int_{\text{supp}({Z_T})} p(z_0, z_1, \dots z_T)\\ - p(z_0, z_1, \dots z_T) = p(z_T) \prod_{t=0}^{T} p(z_{t-1} \vert z_t), ` +`p(\underbrace{o,a}_{= z_0}) = \int_{\text{supp}({Z_0})} \int_{\text{supp}({Z_1})} \ldots \int_{\text{supp}({Z_T})} p(z_0, z_1, \dots z_T)\\ p(z_0, z_1, \dots z_T) = p(z_T) \prod_{t=0}^{T} p(z_{t-1} \vert z_t), ` $$ + where we explicitly showed the marginalization over the multiple latents in [eq:BC-multi-latent-model-1], and used the law of conditional probability and Markov property in [eq:BC-multi-latent-model-2]. [eq:BC-multi-latent-model-1]). Similarily to VAEs, DMs approximate the process of sampling from the unknown $p(o,a)$ (1) sampling from an easy-to-sample distribution (e.g., Gaussian) and (2) learning to reconstruct high-likelihood samples under the unknown distribution. However, in stark contrast with VAEs, the easy-to-sample distribution contains *no mutual information* regarding the data distribution $p(o,a)$. Crucially, as no information from the sample $(o,a)$ (denoted as $z_0 \equiv (o,a)$ for the sake of notation) is assumed to be propagated throughout the chain of latents, the posterior $q(z_t \vert z_{t-1})$ assumes a relatively amicable structure in DMs, reducing complexity. The *true* likelihood $p(z_{t-1} \vert z_t)$ is instead typically approximated using the parametrization $p_\theta (z_{t-1} \vert z_t)$. In that, the information contained in the unknwon data distribution is *reconstructed* via a process in which samples from a fixed distribution are turned into (ideally) high-likelihood samples under $p(o,a)$--a process referred to as *denoising*. Under such model, we can express the log-likelihood of an arbitrary sample as[^4] + $$ -`\log p_\theta (\underbrace{o,a}_{= z_0}) = - \mathbb{E}_{z_1 \sim q(\bullet \vert z_0)} \log p_\theta (z_0 \vert z_1) -\\ - \mathbb{E}_{z_{T-1} \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_T \vert z_{T-1}) \Vert p(z_T) ) \big] - \notag\\ - \sum_{t=1}^{T-1} \mathbb{E}_{(z_{t-1}, z_{t+1}) \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_t \vert z_{t-1}) \Vert p_\theta(z_t \vert z_{t-1}) ) \big], \notag` -$$providing an optimization target in the form of$\max_\theta \log p_\theta (\mathcal D)$. +`\log p_\theta (\underbrace{o,a}_{= z_0}) = \mathbb{E}_{z_1 \sim q(\bullet \vert z_0)} \log p_\theta (z_0 \vert z_1) -\\ \mathbb{E}_{z_{T-1} \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_T \vert z_{T-1}) \Vert p(z_T) ) \big] - \notag\\ \sum_{t=1}^{T-1} \mathbb{E}_{(z_{t-1}, z_{t+1}) \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_t \vert z_{t-1}) \Vert p_\theta(z_t \vert z_{t-1}) ) \big], \notag` +$$ + + providing an optimization target in the form of $\max_\theta \log p_\theta (\mathcal D)$. In their seminal work on using DMs for variational inference, @hoDenoisingDiffusionProbabilistic2020 introduce major contributions regarding solving $\min_\theta -\log p_\theta(o,a)$. In particular, @hoDenoisingDiffusionProbabilistic2020 exclusively adopt a fixed *Gaussian* posterior in the form of $q(z_t \vert z_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}z_{t-1}, \beta_t \mathbf I)$. The choice of adopting Gaussians has profound implications on the generative process modeled. Indeed, under the (mild) assumption that the variance is sufficiently small $\beta_t \leq \eta, \eta \in \mathbb R^+$, @sohl-dicksteinDeepUnsupervisedLearning2015 proved that the likelihood $p(z_{t-1} \vert z_t)$ is Gaussian as well, which allows for the particularly convenient parametrization of the approximate likelihood $p_\theta (x_{t-1} \vert x_t) = \mathcal N(\mu_\theta(x_t, t), \Sigma_\theta(x_t,t)), \ t \in [1,T]$, as well as for closed-form tractability of the KL-divergence terms in [eq:diffusion-likelihood]. Further, the posterior’s structure also enables an analytical description for the distribution of the $t$-th latent variable, $q(z_t \vert z_0) = \mathcal N (\sqrt{\bar{\alpha}_t}z_0, (1-\bar{\alpha}_t) \mathbf{I})$, with $\alpha_t = 1-\beta_t, \ \bar \alpha_t = \prod_{k=1}^t \alpha_k$, which conveniently prevents iterative posterior sampling. @@ -860,20 +872,20 @@ Finally, adopting Gaussian posteriors permits a particularly pleasing interpreta caption={'A joint action-observation distribution, in the simplified case where the observation is the elbow-flex actuation in a SO-100, and the action is the recorded position for the same joint in the teleoperator arm. The motion recorded being teleoperated, the points distribute along a the diagonal.'} /> -Because the recorded behavior is teleoperated, measurements mostly distribute along the line $a = o + \eta, \eta \sim N(0,1)$, with $\eta$-variability accouting for minor control inconsistencies (Figure 26). Using Gaussian posteriors--i.e., adding Gaussian noise--effectively simulates a *Brownian motion* for the elements in the distribution’s support (in Figure 25, $\mathcal O\times \mathcal A$), whereby information *diffuses away* from the samples, and comparing the diffused samples to the original data points one can derive an estimate of the total displacement induced by diffusion. Under the only assumption that the likelihood of the diffused samples is low under the original unknown data distribution, then one can effectively approximate the unkwown distribution by learning to *reverse* such displacement. This key intuition allows to write a simplified training objective: $ - \mathcal L(\theta) = \mathbb{E}_{t, z_0, \epsilon} \big[ - \Vert \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} z_0 + \epsilon \sqrt{1 - \bar \alpha_t}, t) \Vert^2 \big], \quad t \sim \mathcal{U}(\{1,\dots,T\}), \quad - z_0 \sim \mathcal{D}, \quad - \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I}).$ +Because the recorded behavior is teleoperated, measurements mostly distribute along the line $a = o + \eta, \eta \sim N(0,1)$, with $\eta$-variability accouting for minor control inconsistencies (Figure 26). Using Gaussian posteriors--i.e., adding Gaussian noise--effectively simulates a *Brownian motion* for the elements in the distribution’s support (in Figure 25, $\mathcal O\times \mathcal A$), whereby information *diffuses away* from the samples, and comparing the diffused samples to the original data points one can derive an estimate of the total displacement induced by diffusion. Under the only assumption that the likelihood of the diffused samples is low under the original unknown data distribution, then one can effectively approximate the unkwown distribution by learning to *reverse* such displacement. This key intuition allows to write a simplified training objective: $ \mathcal L(\theta) = \mathbb{E}_{t, z_0, \epsilon} \big[ \Vert \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} z_0 + \epsilon \sqrt{1 - \bar \alpha_t}, t) \Vert^2 \big], \quad t \sim \mathcal{U}(\{1,\dots,T\}), \quad z_0 \sim \mathcal{D}, \quad \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I}).$ In this simplified (minimization) objective, the optimization process differs from [eq:diffusion-likelihood] in that, rather than maxizing $p_\theta$ directly, the parameters $\theta$ of the pairwise likelihood $p_\theta(z_{t-1} \vert z_t)$ are adjusted to *predict the total displacement* $\epsilon$ for a randomly long ($t \sim \mathcal{U}(\{1,\dots,T\}$ )) diffusion process starting from a sample of the target distribution. By learning the total displacement from a generally, uninformative corrupted sample obtained diffusing information and a sample from an unknown distribution--significant ($\Vert \epsilon \Vert > 0$) whenever input and target distribution are sufficiently different-- @hoDenoisingDiffusionProbabilistic2020 show that one can approximate the underlying distribution reversing the displacement, *denoising* samples. Interestingly, under the hypothesis real-world data belongs to a single higher dimensional manifold (Manifold Hypothesis), @permenterInterpretingImprovingDiffusion2024 show that diffusion learns the gradient of a distance function from any off-point manifold (such as perturbed, uniformative samples), and the data manifold itself. Following this gradient--i.e., denoising a sample from an uninformative distribution--corresponds to projecting back into the manifold, yielding a procedure to sample from unknown distributions by means of Euclidean projection. Indeed, under the assumption that $p_\theta (z_{t-1} \vert z_t)$ is Gaussian, then sampling $z_{t-1} \sim p_\theta(\bullet \vert z_{t})$ corresponds to computing $z_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( z_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \epsilon_\theta(z_t, t) \right) + \sigma_t \epsilon, \quad \epsilon \sim \mathcal N(\mathbf{0}, \mathbf{I}), $ thus showing that the lower-level latent variables in a DM can be obtained by iteratively removing noise from the one-step higher order variable, using the noise regressor $\epsilon_\theta(z_t, t)$ learned minimizing [eq:diffusion-simplified-loss]. -### Flow Matching +#### Flow Matching The posterior parametrization adopted by DMs proved traditionally effective, yet it raised concerns circa its efficiency at inference time, where a possibly large of compute-expensive denoising steps are needed in order to recover a sample from the target distribution. Flow Matching (FM) @lipmanFlowMatchingGenerative2023 extends DMs to the general case of arbitrary, parametrized likelihood and posteriors, and in this defines a superseding class of GMs providing a unified framework for learning *continuous transformations* between distributions, encompassing and generalizing DMs. Instead of a *stochastic, discrete, multi-step* denoising process, FM aims to learn a *deterministic, continuous, differentiable flow* $\psi [0,1] \times Z \mapsto Z$, formalized starting from possibly time-dependent vector field $v: [0,1] \times Z \mapsto Z$ transporting samples from a simple prior distribution $p_0$--e.g., a standard Gaussian--to a more complex, potentially unknown data distribution $p_1$ over time. Note how FM models time $t \in [0,1]$ to be varying continuously while moving away *from* an easy-to-sample distribution $p_0$ *towards* the unknown data-distribution, $p_1$. This results in a continuous and deterministic trajectory for each sample, which can be more efficient to generate compared to the stochastic paths of DMs. Formally, FM can be fully characterized by an ordinary differential equation (ODE) relating instantaneous variations of flows with the underlying vector field, and hence providing complete trajectories over the distributions’ support when integrating over time, -$$\frac{d}{dt} \psi(z, t) = v(t, \psi(t, z))\\ \psi(0, z) = z$$ + +$$ +`\frac{d}{dt} \psi(z, t) = v(t, \psi(t, z))\\ \psi(0, z) = z` +$$ + FM proved very effective in a variety of applications, ranging from image @esserScalingRectifiedFlow2024 and video generation @polyakMovieGenCast2025 to robotics control @blackp0VisionLanguageActionFlow2024. Most notably, in their introductory work on FM for GM, @lipmanFlowMatchingGenerative2023 show how DMs can be seen as a specific instance of FM where the *conditional* target vector field $u$ approximated by the noise regressor corresponds to @@ -909,17 +921,13 @@ While the noising schedule of DMs results in a stochastic process that resembles In practice, FM can be applied to generative modeling by learning a vector field regressor $v_\theta(z, t)$ to approximate a given target vector field $u(t, z)$. In the particular case of DMs, $u(t, z)$ is defined as in [eq:fm-diffusion-vector-field], while in priciple the target vector field can be learned to induce a particular transportation, or fixed according to OT. Given a sample from the data distribution $z_1 \sim p_1$ and a sample from an easy-to-sample prior $z_0 \sim p_0$, CFM defines a simple path between them using *linear interpolation* between samples $z_t = (1-t)z_0 + t z_1$, resulting in the target vector field $u(t, z_t) = z_1 - z_0$. Then, a FM model can be trained with the simple regression objective defined as $ \mathcal L(\theta) = \mathbb{E}_{t, z_0, z_1} \big[ \Vert v_\theta((1-t)z_0 + t z_1, t) - (z_1 - z_0) \Vert^2 \big], \quad t \sim \mathcal{U}([0,1]),$ where $z_0 \sim p_0(\bullet)$ and $z_1 \sim p_1(\bullet)$. Note how in [eq:flow-matching-objective]--differently from [eq:diffusion-simplified-loss]--time is assumed to be varying continuously $t \sim \mathcal U([0,1])$ rather than discretely $t \sim \mathcal U(\{0,1\})$, a key property of flow-based models. The objective in [eq:flow-matching-objective] directly regresses the learned vector field onto the simple, straight path connecting a point from the prior and a point from the data, providing a simulation-free training procedure that is both stable and efficient. At inference time, samples are generated by starting with $z_0 \sim p_0$ and iteratively refined according to $\frac{dz}{dt} = v_\theta(z_t, t)$ for $t \in [0,1]$--an operation that can be numerically carried out with standard ODE solvers. -## Action Chunking with Transformers +### Action Chunking with Transformers While GMs prove useful in learning complex, high-dimensional multi-modal distributions, they do not natively address the compouding errors problem characteristic of online, sequential predictions. In Action Chunking with Transformers (ACT), @zhaoLearningFineGrainedBimanual2023 present an application of VAEs to the problem of learning purely from offline trajectories, introduce a simple, yet effective method to mitigate error compounding, learning high-fidelity autonomous behaviors. Drawing inspiration from how humans plan to enact atomically sequences of the kind $a_{t:t+k}$ instead of single actions $a_t$, @zhaoLearningFineGrainedBimanual2023 propose learning a GM on a dataset of input demonstrations by modeling *action chunks*. Besides contributions to learning high-performance autonomous behaviors, @zhaoLearningFineGrainedBimanual2023 also introduce hardware contributions in the form of a low-cost bimanual robot setup (ALOHA) capable of performing fine-grained manipulation tasks, such as opening a lid, slotting a battery in its allotment or even prepare tape for application. On the robot learning side of their contributions, @zhaoLearningFineGrainedBimanual2023 adopt transformers as the architectural backbone to learn a *Conditional* VAE @sohnLearningStructuredOutput2015. Conditional VAEs are a variation of the more standard VAE formulation introducing a conditioning variable on sampling from the latent prior, allowing the modeling of *one-to-many* relationships between latent and data samples. Further, in stark contrast with previous work @florenceImplicitBehavioralCloning2022, @jannerPlanningDiffusionFlexible2022, @zhaoLearningFineGrainedBimanual2023 do not learn a full joint $p_\theta(o,a)$ on observation and actions. While the *policy* distribution $p_\theta(a \vert o)$ can in principle be entirely described from its joint $p_\theta(o,a)$, it is often the case that the conditional distribution is intractable when using function approximators, as $p_\theta(a \vert o) = \tfrac{p_\theta(o,a)}{\int_\mathcal Ap_\theta(o,a)}$ and the integral in the denominator is typically intractable. Instead of modeling the full joint using a vanilla VAE, @zhaoLearningFineGrainedBimanual2023 propose learning a *conditional* VAE @sohnLearningStructuredOutput2015 modeling the policy distribution directly $p (a \vert o)$. -In practice, when learning from demonstrations adopting CVAEs results in a slight modification to the VAE objective in [eq:ELBO], which is adapted to $ - \text{ELBO}_{\mathcal D}(\theta, \phi, \omega) = \sum_{i=0}^{N} \left( - \mathbb{E}_{z \sim q_\phi(\cdot \vert o_i, a_i)} \big[ \log p_\theta(a_i \vert z, o_i) \big] - - \text{D}_{\text{KL}}\big[ q_\phi(z \vert o_i, a_i) \Vert p_\omega(z \vert o_i) \big] - \right)$ Notice how in [eq:c-ELBO] we are now also learning a new set of parameters $\omega$ for the prior distribution in the latent space. Effectively, this enables conditioning latent-space sampling (and thus reconstruction) during training, and potentially inference, providing useful when learning inherently conditional distributions like policies. Further, ACT is trained as a $\beta$-CVAE @higgins2017beta, using a weight of the KL regularization term in [eq:c-ELBO] as an hyperparameter regulating the information condensed in the latent space, where higher $\beta$ results in a less expressive latent space. +In practice, when learning from demonstrations adopting CVAEs results in a slight modification to the VAE objective in [eq:ELBO], which is adapted to $ \text{ELBO}_{\mathcal D}(\theta, \phi, \omega) = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim q_\phi(\cdot \vert o_i, a_i)} \big[ \log p_\theta(a_i \vert z, o_i) \big] - \text{D}_{\text{KL}}\big[ q_\phi(z \vert o_i, a_i) \Vert p_\omega(z \vert o_i) \big] \right)$ Notice how in [eq:c-ELBO] we are now also learning a new set of parameters $\omega$ for the prior distribution in the latent space. Effectively, this enables conditioning latent-space sampling (and thus reconstruction) during training, and potentially inference, providing useful when learning inherently conditional distributions like policies. Further, ACT is trained as a $\beta$-CVAE @higgins2017beta, using a weight of the KL regularization term in [eq:c-ELBO] as an hyperparameter regulating the information condensed in the latent space, where higher $\beta$ results in a less expressive latent space. In their work, @zhaoLearningFineGrainedBimanual2023 ablated using a GM to learn from human demonstrations compared to a simpler, supervised objective, $\mathcal L_1(a,a^\prime) = \Vert a - a^\prime \Vert_1$. Interestingly, they found the performance of these two approaches to be comparable when learning from *scripted* demonstrations. That is, when learning from data collected rolling out a predetermined set of commands $[q^c_0, q^c_1, \dots]$, GM did *not* prove competitive compared to standard supervised learning. However, when learning from human demonstrations--i.e., from data collected executing commands coming from a human controller $[q^h_0, q^h_1, \dots]$--they found performance (success rate on a downstream task) to be severily (-33.3%) hindered from adopting a standard supervised learning objective compared to a richer, potentially more complex to learn variational objective, in keeping with the multimodal nature of human demonstrations data and findings presented in @florenceImplicitBehavioralCloning2022. The authors also ablate the action chunking paradigm, reporting significant performance gains for performing action chunking (1% vs. 44% success rate). To avoid acting openloop, @zhaoLearningFineGrainedBimanual2023 design an inference process consisting in performing inference at every timestep $t$ and then aggregate overlapping chunks using chunks’ exponential moving average. @@ -957,20 +965,18 @@ However, the authors claim using a deterministic procedure to derive $z$ may ben caption={'The CVAE decoder used in ACT, comprising of a full encoder-decoder Transformer architecture. Camera observations from all n camera views are first embedded using pre-trained visual encoders, and then concatenated to the corresponding positional embeddings. Then, alongside embeddings for the proprioperceptive information available and the style variable z retrieved from the CVAE encoder, the Transformer encoder shares the matrices K, Q with the Transformer decoder, trained to decode fixed position embeddings into action valid chunks.'} /> -### Code Example: Learning ACT +#### Code Example: Learning ACT -## Diffusion Policy +### Diffusion Policy DMs proved very effective in approximating complex highly dimensional distributions, such as distributions over images @hoDenoisingDiffusionProbabilistic2020 or videos @polyakMovieGenCast2025, thanks to their inherent capability to deal with multimodal data and training stability. In Diffusion Policy (DP), @chiDiffusionPolicyVisuomotor2024 present an application of DMs the field of robot learning, leveraging diffusion to model human expert demonstrations in a variety of simulated and real-world tasks. Similarily to Action Chunking with Transformer @zhaoLearningFineGrainedBimanual2023, @chiDiffusionPolicyVisuomotor2024 (1) adopt a modified *observation-conditioned target distribution* instead of the full joint $p(o,a)$ and (2) predict multiple actions into the future instead of a single action. Besides the intractability of the observations’ marginal $p_\theta(o)$ given $p_\theta(o,a)$, DP’s rationale for modeling the data distribution via $p_\theta(a \vert o)$ stems from the rather test-time compute intensive nature of diffusion, whereby generating actions *alongside* observations is likely to result in higher complexity and thus a likely larger number of denoising operations, which would prove ultimately pointless considering robotics applications rely on the capability to generate controls rather than reproducing observations. In practice, conditioning on observation data is achieved conditioning the added noise regressor $\epsilon_\theta$ introduced in [eq:diffusion-simplified-loss] on a stack of $T_o$ observations, resulting in the *conditional* simplified diffusion objective + $$ -`\mathcal L(\theta) = \mathbb{E}_{t, a_{t:t+H_a}, \epsilon} \big[ - \Vert \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} a_{t:t+T_a} + \epsilon \sqrt{1 - \bar \alpha_t}, t, o_{t-T_o:t}) \Vert^2 \big],\\ - t \sim \mathcal{U}(\{1,\dots,T\}), \quad - a_{t:t+T_a}, o_{t-T_o:t} \sim \mathcal{D}, \quad - \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I}). \notag` +`\mathcal L(\theta) = \mathbb{E}_{t, a_{t:t+H_a}, \epsilon} \big[ \Vert \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} a_{t:t+T_a} + \epsilon \sqrt{1 - \bar \alpha_t}, t, o_{t-T_o:t}) \Vert^2 \big],\\ t \sim \mathcal{U}(\{1,\dots,T\}), \quad a_{t:t+T_a}, o_{t-T_o:t} \sim \mathcal{D}, \quad \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I}). \notag` $$ + Notice how in [eq:diffusion-policy-objective] the noise regressor is conditioned both on the latent variable rank $t$ *and* on a stack of previous observations $o_{t-T_o:t}$.  @chiDiffusionPolicyVisuomotor2024 claim the combination of (1) conditioning on a horizon of previous observations and (2) predicting multiple actions into the future allows DP to *commit to specific modes* in the data at inference time, which proves essential for good performance and avoiding undecisiveness. 32), which are however reported to be biased towards learning low-frequency components @tancikFourierFeaturesLet2020 and thus may prove more challenging to train with non-smooth action sequences. -### Code Example: Learning Diffusion Policies +#### Code Example: Learning Diffusion Policies -## Optimized Inference +### Optimized Inference Modern visuomotor policies output *action chunks*-sequences $\pi(o_t) = \mathbf{A}_t$ with $\mathbf{A}_t = \bigl(a_t,a_{t+1},\dots,a_{t+H_a}\bigr)$ being a sequence of $H_a \gg 1$ low-level commands enqueued in an action queue, originating from an environment observation, $o_t$. Predicting series of actions instead of single commands proved essential in learning complex, multi-modal behavior @zhaoLearningFineGrainedBimanual2023, @chiDiffusionPolicyVisuomotor2024. @@ -1009,8 +1015,7 @@ We directly assess the lack of adaptiveness of robot systems due to acting open- alt="Figure" />
-
-Asynchronous inference. Illustration of the asynchronous inference stack. Note that the policy can be run on a remote server, possibly with GPUs.
+
Asynchronous inference. Illustration of the asynchronous inference stack. Note that the policy can be run on a remote server, possibly with GPUs.
@@ -1024,7 +1029,7 @@ $\mathbf{A}_{t+1} \gets \mathbf{A}_t$
-#### Implementation details +##### Implementation details *Async* inference (1) tightens the control loop by capturing observations more often, directly eliminates idle gaps at runtime, and (2) directly allows to run inference on more powerful computational resources than the ones typically available onboard autonomous robotic platforms. @@ -1050,15 +1055,13 @@ Interestingly, the behavior of async inference can be studied analytically. Firs alt="Figure" />
-
Action queue size evolution at runtime for various levels of -g - when (A) not filtering out observation based on joint-space similarity and (B) filtering out near-duplicates observation, measuring their similarity in joint-space.
+
Action queue size evolution at runtime for various levels of g when (A) not filtering out observation based on joint-space similarity and (B) filtering out near-duplicates observation, measuring their similarity in joint-space.
34 emphasizes the trade-off governed by $g$: small values place result in idle periods, whereas $g\approx 1$ assumes a highly accurate model and pays a significant compute price. In practice, choosing $g\in(0,1)$ allows to strike a balance between reactivity against resource budgets. If not for the aforementioned similarity filter, the would send observations for processing every $(1 - g) H_a \cdot \Delta t$ seconds, receiving a new chunk of actions every $(1 - g) H_a \cdot \Delta t + \mathbb E[\ell_S]$, on average. The presence of the observation similarity filter dilates this processing time, and serves the scope of avoiding the robot stalling due to the queue being constantly integrated with an incoming, nearly identical, action chunk. In particular, 34 results in a queue which is filled with incoming actions *unless* near-duplicate observations are filtered out from the processing pipeline. For clarity, the red arrow in 34 highlights a timestep where the observation similarity mechanism is bypassed, forcing a (nearly identical) observation to be processed as the queue results empty. -### Code Example: Using Async Inference +#### Code Example: Using Async Inference -# Generalist Robot Policies +## Generalist Robot Policies
@@ -1085,7 +1088,7 @@ The advent of large models trained on internet-scale datasets has drastically in caption={'Fields within ML such as Computer Vision and NLP converged on the development of foundation models, trained on a variety of large scale models and capable to perform multiple downstream tasks (top). Conversely, robotics suffered from limited standardization in terms of the architectures used, and siloed, task specific datasets, incurring in a high degree of fragmentation which traditionally hindered the development of generalist models for robotics in favour of task-specific models (bottom).'} /> -## Preliminaries: Models and Data +### Preliminaries: Models and Data The remarkable success of foundation models in NLP and CV is predicated on two core principles: architectural innovation and joint data-compute scaling. The transformer architecture proved instrumental in capturing long-range dependencies in sequential data such as text, and its stability and expressivity made it the *de facto* standard for modern large-scale models trained on internet-scale amounts of data. In stark contrast with popular NLP @raffelExploringLimitsTransfer2023 and CV @ImageNet_VSS09 general-purpose datasets, the field of robotics has historically developed around task-specific datasets which hinders scalability across problems, resulting in a concrete data deficit for general-purpose robot learning. Unlike the wealth of relatively readily available text and images on the internet, robotics data is intrinsically embodied--datasets collected for a manipulation robot typically differ entirely from locomotion datasets. Further, datasets consisting of expert demonstrations are (1) intrinsically expensive to collect (2) and notoriously heterogeneous--different human experts may perform the same task optimally yet in very different ways. In particular, since each expert trajectory is tied to a specific robot platform and the operating conditions of its environment and task, data heterogeneity has long posed a *methodological* challenge for scaling robotics datasets via aggregation. Beyond this, heterogeneity also raises *conceptual* issues: naively mixing data across embodiments can induce negative transfer, as control strategies developed in isolation for different robot systems in different environments may even conflict when combined. Thus, the high degree of fragmentation of robotics datasets and tasks has traditionally led to the development of *specialist* policies, trained on small, task-specific datasets, and which excel at their designated task but fail to generalize to new situations (Figure 35). @@ -1119,19 +1122,19 @@ The success of large, proprietary models like RT-1 and RT-2, highlighted a growi Figure 37 illustrates graphically the two most relevant trends in modern robot learning. As datasets collected via centralized, cross-institutions cooperation of increasing size are made available for the research community, decentralized datasets collected by individual researchers and practitioners have also gained traction recently, closing the gap with academic benchmarks thanks to community-contributed datasets. Further, models used across tasks and embodiments are also becoming much more compute-efficient, and as a result the models’ size has been consistently reducing over time, with consequent gains for autonomous robots in real-world, resource-constrained environments. -## Modern VLAs +### Modern VLAs Modern recipes to train large scale VLAs extend early efforts to learn foundation models from large amounts of data via BC, introducing significant advancements concerning both architectural and procedural aspects. From an architectural perspective, modern VLAs such as $\pi_0$ @blackp0VisionLanguageActionFlow2024 leverage a *unified transformer model* for efficiency of computation, while maintaining specialized sub-components within the model for visual perception and action prediction, enabling cross-task performance via language conditioning. Crucially, modern VLAs including @blackp0VisionLanguageActionFlow2024\[$\pi_0$\] and @shukorSmolVLAVisionLanguageActionModel2025\[SmolVLA\] adopt *unified* transformer models employing disjoint set of weights (*experts*) for compute-efficient visual-semantic understanding and robotic control. Procedurally, modern VLAs complement advanced Vision-Language Model (VLM) backbones with action-specific modules (1) adopting mid-sized *action experts* to model continuous actions distributions $p (a_{t:t+H_a} \vert o_t)$--avoiding discrete action tokens entirely--and (2) relying on *action chunking*  as a strategy to reduce error compounding when predicting multiple actions learning from inherently non-i.i.d. data, such as demonstration data. These architectural and procedural innovations present three benefits. First, developing architectures that exploit internet-scale pre-trained backbones allows to fully capitalizes on the vast world knowledge and skills state-of-the-art VLMs exhibit, preventig models from needing to learn visual, linguistic and semantic concepts from scratch. Second, using generative models for continuous action distributions allows to learn rich, multimodal data distributions, a much more likely scenario in the big-data regime typically tackled while developing generalist policies. Further, introducing two separate components for perception and action planning could enable using Mixture of Experts (MoE) architectures @fedusReviewSparseExpert2022, more efficient to run and thus resulting in faster inference--a key features for models deployed in real-world scenarios. This new paradigm has been at the core of some of the most capable generalist policies developed to date, capable to few-shot adapt to novel tasks and to perform highly dexterous manipulation tasks, ranging from end-to-end folding laundry, to bussing tables. -### VLMs for VLAs +#### VLMs for VLAs VLMs are designed to process both visual and textual modalities--most commonly by taking both images and text as input and generating text conditioned on the visual context. Recent advances in VLMs have been driven by the success of LLMs, with many approaches building upon pretrained LLMs and adopting similar training paradigms to the ones used in language modeling. Typically, VLMs @alayracFlamingoVisualLanguage2022, @laurenconWhatMattersWhen2024, @linVILAPretrainingVisual2024 are constructed by integrating a pretrained vision encoder @radfordLearningTransferableVisual2021, @zhaiSigmoidLossLanguage2023, @finiMultimodalAutoregressivePretraining2024 with a pretrained LLM @grattafioriLlama3Herd2024, @jiangMistral7B2023. Training then proceeds in multiple multimodal stages, beginning with a large-scale pretraining on datasets containing image-text pairs @LAION-COCO, @kakaobrain2022coyo700m and interleaved vision-language corpora @OBELICS, @MMC4, all followed by a supervised fine-tuning stage on instruction-tuning datasets @LLaVA-1.5, @tong2024cambrian, @laurenconWhatMattersWhen2024. The inherent multimodal nature of VLMs enables them to jointly reason over vision and language. Pre-training on vast internet-scale datasets allows these models to associate visual patterns with textual descriptions, thereby acquiring a rich semantic understanding of the world--knowledge about objects, their properties, and relationships--without explicit supervision for each concept. In turn, integrating a VLM as a perception backbone for a VLA allows the complete model to inherit rich world knowledge, sidestepping the need to learn visual and semantic representations from scratch. In principle, this allows the robot to ground high-level natural language instructions in its visual context, and possibly recognize unseen objects by connecting them to pre-trained concepts absorbed during pre-training, improving on the possibility to generalize to novel scenarios. Recently, compute efficiency has also become a central focus in VLM research. Several works aim to reduce training costs by using smaller, more diverse datasets @LLaVA-1.5, @InstructBLIP, @bai2025qwen25vl, @zhu2024minigpt, @tong2024cambrian, training smaller-scale models @marafiotiSmolVLMRedefiningSmall2025, @moondream, @minicmpv2024, or by adapting pretrained unimodal models by tuning only a small subset of parameters @shukor2023epalm, @vallaeys2024improveddepalm, @MAPL, @FROMAGe, @tsimpoukelli2021multimodalfrozen, @BLIP-2. While the majority of VLM research focuses on image and text modalities, recent work has demonstrated that similar techniques can be extended to integrate additional modalities, such as video and audio @wang2025internvideo2, @liu2024kangaroo, @zhang2025videollama, @kong2024audioflam--a particularly promising direction of research for robotics applications, where multiple sensor modalities can be integrated effectively. This trend towards efficiency is paramount for robotics applications, where policies must operate under the stringent constraints of real-world deployment. Indeed, robots often possess limited on-board computational resources and must react in real-time to dynamic environments. Smaller and faster VLMs have thus become quintessential for developing responsive autonomous systems, enabling high-frequency control loops by reducing the latency between perception and action. -## $\pi_0$ +### $\pi_0$ $\pi_0$ @blackp0VisionLanguageActionFlow2024 introduce a VLA consisting of a MoE architecture consisting of (1) a pre-trained VLM backbone (Gemma 2.6B @teamGemma2Improving2024) and (2) a dedicated action expert used to generate continuous actions via flow matching. Images and language are embedded with a late-fusion VLM (PaliGemma), while proprioceptive state and actions chunks are routed to a smaller action expert, initialized from scratch. The two separate experts communicate via self-attention layers, but maintain disjoint weights to obtain query, key and values matrices at each layer, maintaining specialization while efficiently allocating computation. @@ -1145,28 +1148,15 @@ $\pi_0$ @blackp0VisionLanguageActionFlow2024 introduce a VLA consisting of a Mo caption={'The π 0 architecture, as in @blackp0VisionLanguageActionFlow2024. Vision and language tokens are routed to a VLM backbone which is prevented from attending robot proprioperceptive states and action tokens, which are instead routed to a smaller subset of weights within the architecture. The architecture is trained with Flow Matching on 10M+ trajectories from a mixture of closed and openly available datasets.'} /> -Concretely, $\pi_0$ is a unified transformer with two disjoint sets of weights $\phi, \theta$. A larger VLM backbone $p_\phi$ initialized from Gemma 2.6B processes multiple image frames obtained from multiple cameras points $[\{ I_t \}_{t=1}^n]$, as well as a language instruction $[\ell_t]$ used to describe the task considered. Concurrently, a 300M-parameter *action expert* based on a similar transformer architecture is used processes the robot proprioperceptive state $q_t$ and an action chunk $a_{t:t+H_a}$ (Figure 38). The different expert networks operate separately in processing the respective inputs and turning them into query, key and value matrices, and only share information between each other via self-attention layers. The outputs from the VLM backbone are disregarded, while the vector field regressed by the action expert is used to iteratively refine the action process. In particular, $\pi_0$uses a *blockwise causal attention mask* over tokens belonging to three separate blocks: (1) image and language tokens $\mathcal T_i$ obtained from $[\{ I_t \}_{t=1}^n, \ell_t]$, (2) proprioperceptive tokens $\mathcal T_q$ obtained from $q_t$, and (3) the action tokens $\mathcal T_a$ for items in the chunk $a^{\tau}_{t:t+H_a}$ at time $\tau$ in the flow-matching process. Notably, *within* each block the attention operations are bidirectional, while across blocks, future blocks are masked out. Formally, this corresponds to using the attention mask $\mathbf{A} = - \bordermatrix{ - \mathcal{T}_i \mathcal{T}_q \mathcal{T}_a \cr - \mathcal{T}_i \mathbf{1} \mathbf{0} \mathbf{0} \cr - \mathcal{T}_q \mathbf{1} \mathbf{1} \mathbf{0} \cr - \mathcal{T}_a \mathbf{1} \mathbf{1} \mathbf{1} \cr - }, - \quad \mathbf{1}: \text{Bidirectional Attention}, \ \mathbf{0}: \text{Masked Attention}$ Note how *intra*-block directional attention allows tokens to communicate freely, while *inter*-block communication is mediated by the attention mask $\mathbf{A}$. *Blockwise causal masking* effectively prevents the pre-trained perception-language tokens from attending to robotics-tokens, likely out of distribution for VLM backbones traditionally trained on large corpora of internet, non-robotics, data. Crucially, because communication is obstructed between image-language tokens, proprioperceptive and action tokens, one can cache keys and values across denoising steps at runtime time, incuring in a reduced computational footprint and faster inference. +Concretely, $\pi_0$ is a unified transformer with two disjoint sets of weights $\phi, \theta$. A larger VLM backbone $p_\phi$ initialized from Gemma 2.6B processes multiple image frames obtained from multiple cameras points $[\{ I_t \}_{t=1}^n]$, as well as a language instruction $[\ell_t]$ used to describe the task considered. Concurrently, a 300M-parameter *action expert* based on a similar transformer architecture is used processes the robot proprioperceptive state $q_t$ and an action chunk $a_{t:t+H_a}$ (Figure 38). The different expert networks operate separately in processing the respective inputs and turning them into query, key and value matrices, and only share information between each other via self-attention layers. The outputs from the VLM backbone are disregarded, while the vector field regressed by the action expert is used to iteratively refine the action process. In particular, $\pi_0$uses a *blockwise causal attention mask* over tokens belonging to three separate blocks: (1) image and language tokens $\mathcal T_i$ obtained from $[\{ I_t \}_{t=1}^n, \ell_t]$, (2) proprioperceptive tokens $\mathcal T_q$ obtained from $q_t$, and (3) the action tokens $\mathcal T_a$ for items in the chunk $a^{\tau}_{t:t+H_a}$ at time $\tau$ in the flow-matching process. Notably, *within* each block the attention operations are bidirectional, while across blocks, future blocks are masked out. Formally, this corresponds to using the attention mask $\mathbf{A} = \bordermatrix{ \mathcal{T}_i \mathcal{T}_q \mathcal{T}_a \cr \mathcal{T}_i \mathbf{1} \mathbf{0} \mathbf{0} \cr \mathcal{T}_q \mathbf{1} \mathbf{1} \mathbf{0} \cr \mathcal{T}_a \mathbf{1} \mathbf{1} \mathbf{1} \cr }, \quad \mathbf{1}: \text{Bidirectional Attention}, \ \mathbf{0}: \text{Masked Attention}$ Note how *intra*-block directional attention allows tokens to communicate freely, while *inter*-block communication is mediated by the attention mask $\mathbf{A}$. *Blockwise causal masking* effectively prevents the pre-trained perception-language tokens from attending to robotics-tokens, likely out of distribution for VLM backbones traditionally trained on large corpora of internet, non-robotics, data. Crucially, because communication is obstructed between image-language tokens, proprioperceptive and action tokens, one can cache keys and values across denoising steps at runtime time, incuring in a reduced computational footprint and faster inference. In $\pi_0$, both the VLM backbone and action expert are update using a *flow matching* loss, and in particular are updated minimizing: + $$ -`\mathcal{L}(\phi, \theta) = - \mathbb{E}_{\tau, \epsilon, o_t, a_{t:t+H_a}}\Big[ - \big\Vert - v_\theta(\underbrace{\tau a_{t:t+H_a} + (1-\tau) \epsilon}_{\tilde a_{t:t+H_a}},\, o_t,\, \tau) - - (\epsilon - a_{t:t+H_a}) - \big\Vert^2 - \Big],\\ - \tau \sim \mathrm{Beta}_{[0,s]}(1.5,1), \quad - \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad - o_t, a_{t:t+H_a} \sim \mathcal D \notag` -$$Where the experts parametrized by the separate weights$\phi, \theta$ interact with each other via self-attention layers only, so that the action expert $v_\theta$ internal computations also depend on the VLM backbone’s parameters $\phi$. Importantly, @blackp0VisionLanguageActionFlow2024 minimize [eq:pi0-loss] over both the multimodal backbone and action expert parameters, thus updating the internal representations of the VLM using BC-specific gradients. In contrast, @driessKnowledgeInsulatingVisionLanguageAction2025 later show that failing to insulate the VLM knowledge from the flow matching gradients actually harms performance. Inference is performed iteratively refining action chunks while numerically forward-integrating the vector field predicted by the action expert, +`\mathcal{L}(\phi, \theta) = \mathbb{E}_{\tau, \epsilon, o_t, a_{t:t+H_a}}\Big[ \big\Vert v_\theta(\underbrace{\tau a_{t:t+H_a} + (1-\tau) \epsilon}_{\tilde a_{t:t+H_a}},\, o_t,\, \tau) - (\epsilon - a_{t:t+H_a}) \big\Vert^2 \Big],\\ \tau \sim \mathrm{Beta}_{[0,s]}(1.5,1), \quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad o_t, a_{t:t+H_a} \sim \mathcal D \notag` +$$ + + Where the experts parametrized by the separate weights $\phi, \theta$ interact with each other via self-attention layers only, so that the action expert $v_\theta$ internal computations also depend on the VLM backbone’s parameters $\phi$. Importantly, @blackp0VisionLanguageActionFlow2024 minimize [eq:pi0-loss] over both the multimodal backbone and action expert parameters, thus updating the internal representations of the VLM using BC-specific gradients. In contrast, @driessKnowledgeInsulatingVisionLanguageAction2025 later show that failing to insulate the VLM knowledge from the flow matching gradients actually harms performance. Inference is performed iteratively refining action chunks while numerically forward-integrating the vector field predicted by the action expert, ``` math \begin{equation} a_{t:t+H_a}^{\tau + \delta} = a_{t:t+H_a}^{\tau } + \delta v_\theta(a_{t:t+H_a}^{\tau }, o_t) @@ -1192,9 +1182,9 @@ Besides adopting a MoE architecture with a VLM backbone initialized from a pre-t Lastly, @blackp0VisionLanguageActionFlow2024 present cross-embodiment experiments where they demonstrate $\pi_0$’s ability to control both mobile and static manipulator robots with varying arm embodiments. The emergence of cross-embodiment capabilities is largely to be attributed to the presence of large scale cross-embodiment data in the data mixture, handled by $\pi_0$defaulting to the maximal configuration size across the $\pi$ dataset, and zero-padding robots with fewer dof. In that $\pi_0$constantly processes 18 DoFs robots (two 6-DoF arms, two grippers, base, vertical torso), regardless of the kind of robot, and robots with fewer dofs are zero-padded. $\pi_0$also relies on three camera views, and uses masked image slots for training and deployment scenarios with fewer cameras. -### Code Example: Using $\pi_0$ +#### Code Example: Using $\pi_0$ -## SmolVLA +### SmolVLA VLAs remain in an early stage of development and are not yet as mature or widely adopted as LLMs and VLMs. Further, much of the impactful VLA progress remains proprietary, with many models sharing only weights while withholding full training details and essential methodological components. SmolVLA @shukorSmolVLAVisionLanguageActionModel2025 is an entirely open-source research effort, aiming to democratize the developments of robotics foundation models by open sourcing model, training recipes and data used. @@ -1208,7 +1198,7 @@ VLAs remain in an early stage of development and are not yet as mature or widely caption={'The SmolVLA architecture, as in @shukorSmolVLAVisionLanguageActionModel2025. SmolVLA is a compact MoE model trained with flow matching to denoise action chunks. Vision and language tokens are fed to a VLM backbone, and share information with the proprioperceptive and action tokens via the attention mechanism. The attention expert interleaves SA and CA layers for further conditioning on the visual features from the VLM backbone. SmolVLA skips computations and reduces the visual tokens, resulting in 6x less memory usage than π 0 .'} /> -While encouraging efforts like $\pi_0$ @blackp0VisionLanguageActionFlow2024 demonstrate the feasibility of open VLA systems, they remain (1) large and compute-intensive and (2) dependent on closed datasets collected via centralized efforts on costly robotic platforms, ultimately hindering accessibility. SmolVLA mitigates both these accessibility issues by (1) prioritizing a compact, compute-efficient VLA design and (2) targeting community-contributed datasets on accessible robotic platforms such as the SO-100 and SO-101 arms. Similarly to $\pi_0$, SmolVLA (Figure 39) employs a MoE architecture combining a pretrained VLM backbone with a dedicated action expert, and trains with flow matching. To ensure efficiency and accessibility, SmolVLA adopts SmolVLM-2 @marafiotiSmolVLMRedefiningSmall2025 as its VLM backbone, considering SmolVLM-2’s reduced size and capability to process multiple image inputs alongside text items. SmolVLM-2 uses SigLIP @zhaiSigmoidLossLanguage2023 as vision encoder, producing visual features for a SmolLM2 language decoder @allalSmolLM2WhenSmol2025. Further, SmolVLA adopts a smaller action expert consisting of $\sim$100M parameters and an interleaved stack of self and cross-attention layers. To improve efficiency, the action expert adopts a reduced embedding dimension compared to the VLM backbone, resulting in $d_{v_\theta} = 0.75 d_{\text{VLM}}$. @shukorSmolVLAVisionLanguageActionModel2025’s design choices thus result in a much smaller size model compared to $\pi_0$, consisting of around 450M parameters versus $\pi_0$’s 3.3B parameters. +While encouraging efforts like $\pi_0$ @blackp0VisionLanguageActionFlow2024 demonstrate the feasibility of open VLA systems, they remain (1) large and compute-intensive and (2) dependent on closed datasets collected via centralized efforts on costly robotic platforms, ultimately hindering accessibility. SmolVLA mitigates both these accessibility issues by (1) prioritizing a compact, compute-efficient VLA design and (2) targeting community-contributed datasets on accessible robotic platforms such as the SO-100 and SO-101 arms. Similarly to $\pi_0$, SmolVLA (Figure 39) employs a MoE architecture combining a pretrained VLM backbone with a dedicated action expert, and trains with flow matching. To ensure efficiency and accessibility, SmolVLA adopts SmolVLM-2 @marafiotiSmolVLMRedefiningSmall2025 as its VLM backbone, considering SmolVLM-2’s reduced size and capability to process multiple image inputs alongside text items. SmolVLM-2 uses SigLIP @zhaiSigmoidLossLanguage2023 as vision encoder, producing visual features for a SmolLM2 language decoder @allalSmolLM2WhenSmol2025. Further, SmolVLA adopts a smaller action expert consisting of $\sim$100M parameters and an interleaved stack of self and cross-attention layers. To improve efficiency, the action expert adopts a reduced embedding dimension compared to the VLM backbone, resulting in $d_{v_\theta} = 0.75 d_{\text{VLM}}$. @shukorSmolVLAVisionLanguageActionModel2025’s design choices thus result in a much smaller size model compared to $\pi_0$, consisting of around 450M parameters versus $\pi_0$’s 3.3B parameters. Effectively, SmolVLA consumes multi-view RGB images, a natural-language instruction, and a projected sensorimotor state token as inputs, together with the noised *action chunk* $\tilde{a_{t:t+H_a}}$ the action expert $v_\theta$ is trained to denoise. In particular, robot proprioperceptive states are projected into a shared token space with the VLM to match $d_{\text{VLM}}$, and successively projected into the expert’s token space. Similarily to $\pi_0$, SmolVLA adopts separate experts communicating exclusively through self-attention layers, which do not employ the same blockwise causal masking in favour of a simple causal masking, resulting in a lower triangular attention mask. @@ -1218,9 +1208,9 @@ SmolVLA trims both token and layer compute. First, it *reduces visual tokens* vi Departing from reliance on proprietary datasets, SmolVLA pretrains exclusively on 450+ *community datasets*, totaling 20K+ trajectories. Because instructions in community contributed dataset can be noisy or missing, the authors re-annotate tasks with a small off-the-shelf VLM using frames sampled from the dataset, and standardize camera viewpoints by mapping sources to a consistent top/wrist/side ordering. At inference, similarily to $\pi_0$, SmolVLA integrates flow over 10 steps, resulting in fast inference. SmolVLA proves effective across a range of both real-world and simulated environments, rivaling $\pi_0$while being close to 40% faster and consuming 6x less memory. -### Code Example: Using SmolVLA +#### Code Example: Using SmolVLA -# Conclusions +## Conclusions This tutorial has chronicled the paradigmatic shift transforming robotics, from the structured, model-based methods of its classical era to the dynamic, data-driven approaches that define modern robot learning. We began by examining the limitations of traditional dynamics-based control, highlighting the brittleness and the significant engineering overhead required by traditional approaches, which in turn motivates more flexible, less model-intensive learning approaches.