import ResponsiveImage from '../../components/ResponsiveImage.astro'
import Ch4BcTrajectories from '../assets/image/ch4/ch4-bc-trajectories.png'
import Ch4ObservationActionMapping from '../assets/image/ch4/ch4-observation-action-mapping.png'
import Ch4IssuesWithBc from '../assets/image/ch4/ch4-issues-with-bc.png'
import Ch4TaskEffectOnPairs from '../assets/image/ch4/ch4-task-effect-on-pairs.png'
import Ch4LatentVariableModel from '../assets/image/ch4/ch4-latent-variable-model.png'
import Ch4ManyLatents from '../assets/image/ch4/ch4-many-latents.png'
import Ch4DiffusionRobotActions from '../assets/image/ch4/ch4-diffusion-robot-actions.png'
import Ch4ActionVsObservationDistribution from '../assets/image/ch4/ch4-action-vs-observation-distribution.png'
import Ch4NormalizingFlows from '../assets/image/ch4/ch4-normalizing-flows.png'
import Ch4DiffusionVsFlowmatching from '../assets/image/ch4/ch4-diffusion-vs-flowmatching.png'
import Ch4Act from '../assets/image/ch4/ch4-act.png'
import Ch4ActEncoder from '../assets/image/ch4/ch4-act-encoder.png'
import Ch4ActDecoder from '../assets/image/ch4/ch4-act-decoder.png'
import Ch4DiffusionPolicy from '../assets/image/ch4/ch4-diffusion-policy.png'
import Ch4AsyncInference from '../assets/image/ch4/ch4-async-inference.png'
import Ch4Queues from '../assets/image/ch4/ch4-queues.png'


# Robot (Imitation) Learning

::: epigraph
*The best material model for a cat is another, or preferably the same cat*

Norbert Wiener
:::

> **TL;DR**
> Behavioral Cloning provides a natural platform to learn from real-world interactions without the need to design any reward function, and generative models prove more effective than point-wise policies at dealing with multimodal demonstration datasets.

<ResponsiveImage src={Ch4BcTrajectories} alt="(A) Average (with standard deviation) evolution of the actuation levels over the first 5 recorded episodes in /svla_so101_pickplace. Proprioperceptive state provide invaluable to determine the robot's state during an episode. (B) Camera frames are also recorded alongside measurements on the robot's state, capturing information about the robot's interaction with its environment." id="fig-fig:ch4-bc-trajectories" />

*(A) Average (with standard deviation) evolution of the actuation levels over the first 5 recorded episodes in /svla_so101_pickplace. Proprioperceptive state provide invaluable to determine the robot's state during an episode. (B) Camera frames are also recorded alongside measurements on the robot's state, capturing information about the robot's interaction with its environment.*

Learning from human demonstrations provides a pragmatic alternative to the reinforcement-learning pipeline discussed in the following section.
Indeed, in real-world robotics online exploration is typically , and designing (dense) reward signals is a process.
In general, success detection itself may often require bespoke instrumentation, while episodic training demands reliable resets---all factors complicating training RL algorithms on hardware at scale.
Behavioral Cloning (BC) sidesteps these constraints by casting control an imitation learning problem, leveraging previously collected expert demonstrations.
Most notably, by learning to imitate autonomous systems naturally adhere to the objectives, preferences, and success criteria implicitly encoded in the data, which obviates reduces early-stage exploratory failures and obviates hand-crafted reward shaping altogether.

Formally, let $\mathcal D = \{ \tau^\{(i)\} \}_\{i=1\}^N$ be a set of expert trajectories, with $\tau^\{(i)\} = \{(o_t^\{(i)\}, a_t^\{(i)\})\}_\{t=0\}^\{T_i\}$ representing the $i$-th trajectory in $\mathcal D $, $ o_t \in \mathcal\{O\}$ denoting observations (e.g., images and proprioception altogether), and $a_t \in \mathcal\{A\}$ the expert actions.
Typically, observations $o \in \mathcal\{O\}$ consist of both image and proprioperceptive information, while actions $a \in \mathcal\{A\}$ represent control specifications for the robot to execute, e.g. a joint configuration.
Note that differently from the following section, in the imitation learning context $\mathcal D $ denotes an offline dataset collecting $ N$ length-$T_i$ reward-free (expert) human trajectories $\tau^\{(i)\}$, and *not* the environment dynamics.
Similarily, in this section $\tau^\{(i)\}$ represent a length-$T_i$ trajectory of observation-action pairs, which crucially *omits entirely any reward* information.
Figure the referenced figure graphically shows trajectories in terms of the average evolution of the actuation on the 6 joints over a group of teleoperated episodes for the SO-100 manipulator.
Notice how proprioperceptive states are captured jointly with camera frames over the course of the recorded episodes, providing a unified high-frame rate collection of teleoperation data.
Figure the referenced figure shows $(o_t, a_t)$-pairs for the same dataset, with the actions performed by the human expert illustrated just alongside the corresponding observation.
In principle, (expert) trajectories $\tau^\{(i)\}$ can have different lengths since demonstrations might exhibit multi-modal strategies to attain the same goal, resulting in possibly multiple, different behaviors.

<ResponsiveImage src={Ch4ObservationActionMapping} alt="Sample observations and action pairs over the course of a given trajectory recorded in /svla_so101_pickplace. Observations, comprising of both proprioperceptive and visual information, are recorded alongside the configuration of a second, leader robot controlled by a human expert, providing complete information for regressing actions given observation." id="fig-fig:ch4-observation-action-mapping" />

*Sample observations and action pairs over the course of a given trajectory recorded in /svla_so101_pickplace. Observations, comprising of both proprioperceptive and visual information, are recorded alongside the configuration of a second, leader robot controlled by a human expert, providing complete information for regressing actions given observation.*

Behavioral Cloning (BC) [@pomerleauALVINNAutonomousLand1988a] aims at synthetizing synthetic behaviors by learning the mapping from observations to actions, and in its most natural formulation can be effectively tackled as a *supevised* learning problem, consisting of learning the (deterministic) mapping $f: \mathcal\{O\} \mapsto \mathcal\{A\}, \ a_t = f(o_t)$ by solving
$$\begin\{equation\}

    \min_\{f\} \mathbb\{E\}_\{(o_t, a_t) \sim p(\bullet)\} \mathcal L(a_t, f(o_t)),
\end\{equation\}$$
for a given risk function $\mathcal L:  \mathcal A \times \mathcal A \mapsto \mathbb\{R\}, \ \mathcal L (a, a^\prime)$.

Typically, the expert's joint observation-action distribution $p: \mathcal\{O\} \times \mathcal\{A\} \mapsto [0,1]$ such that $(o,a) \sim p(\bullet)$ is assumed to be unknown, in keeping with a classic Supervised Learning (SL) framework[^1].
However, differently from standard SL's assumptions, the samples collected in $\mathcal D $, correspoding to observations of the underlying $ p$ are *not* i.i.d., as expert demonstrations are collected *sequentially* in trajectories.
In practice, this aspect can be partially mitigated by considering pairs in a non-sequential order---*shuffling* the samples in $\mathcal D $---so that the expected risk under $ p$ can be approximated using MC estimates, although estimates may in general be less accurate.
Another strategy to mitigate the impact of regressing over non-i.i.d. samples relies on the possibility of interleaving BC and data collection [@rossReductionImitationLearning2011], aggregating multiple datasets iteratively.
However, because we only consider the case where a single offline dataset $\mathcal D $ of (expert) trajectories is already available, dataset aggregation falls out of scope.

Despite the inherent challenges of learning on non-i.i.d. data, the BC formulation affords several operational advantages in robotics.
First, training happens offline and typically uses expert human demonstration data, hereby severily limiting exploration risks by preventing the robot from performing dangerous actions altogether.
Second, reward design is entirely unnecessary in BC, as demonstrations already reflect human intent and task completion.
This also mitigates the risk of misalignment and specification gaming (*reward hacking*), otherwise inherent in purely reward-based RL [@heessEmergenceLocomotionBehaviours2017].
Third, because expert trajectories encode terminal conditions, success detection and resets are implicit in the dataset.
Finally, BC scales naturally with growing corpora of demonstrations collected across tasks, embodiments, and environments.
However, BC can in principle only learn behaviors that are, at most, as good as the one exhibited by the demonstrator, and thus critically provides no mitigation for the suboptimal decision making that might be enaced by humans.
Still, while problematic in sequential-decision making problems for which expert demonstrations are not generally available---data migth be expensive to collect, or human performance may be inherently suboptimal---many robotics applications benefit from relative cheap pipelines to acquire high-quality trajectories generated by humans, thus justifying BC approaches.

<ResponsiveImage src={Ch4IssuesWithBc} alt="Point-wise policies suffer from limitations due to (A) covariate shifts and poor approximation of (B) multimodal demonstrations. (A) Initially small errors may drive the policy out of distribution, incuring in a vicious circle ultimately resulting in failure. (B) Both modes of reaching for a target object in a scene, either left or right-first, are equally as good and thus equally as likely to be present in a dataset of human demonstrations, ultimately resulting in multimodal demonstrations." id="fig-fig:ch4-issues-with-bc" />

*Point-wise policies suffer from limitations due to (A) covariate shifts and poor approximation of (B) multimodal demonstrations. (A) Initially small errors may drive the policy out of distribution, incuring in a vicious circle ultimately resulting in failure. (B) Both modes of reaching for a target object in a scene, either left or right-first, are equally as good and thus equally as likely to be present in a dataset of human demonstrations, ultimately resulting in multimodal demonstrations.*

While conceptually elegant, point-estimate policies $ f : \mathcal\{O\} \mapsto \mathcal\{A\}$ learned by solving the referenced figure have been observed to suffer from (1) compounding errors [@rossReductionImitationLearning2011] and (2) poor fit to multimodal distributions [@florenceImplicitBehavioralCloning2022, keGraspingChopsticksCombating2020].
Figure the referenced figure illustrates these two key issues related to learning *explicit policies* [@florenceImplicitBehavioralCloning2022].
Besides sequentiality in $\mathcal D$, compounding errors due to *covariate shift* may also prove catastrophic, as even small $\epsilon$-prediction errors $0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon $ can quickly drive the policy into out-of-distribution states, incuring in less confident generations and thus errors compounding (Figure the referenced figure, left).Moreover, point-estimate policies typically fail to learn *multimodal* targets, which are very common in human demonstrations solving robotics problems, since multiple trajectories can be equally as good towards the accomplishment of a goal (e.g., symmetric grasps, Figure the referenced figure, right).
In particular, unimodal regressors tend to average across modes, yielding indecisive or even unsafe commands [@florenceImplicitBehavioralCloning2022].
To address poor multimodal fitting, @florenceImplicitBehavioralCloning2022 propose learning the generative model $ p(o, a)$ underlying the samples in $\mathcal D $, rather than an explicitly learning a prediction function $ f(o) = a $.

## A (Concise) Introduction to Generative Models

Generative Models (GMs) aim to learn the stochastic process underlying the very generation of the data collected, and typically do so by fitting a probability distribution that approximates the unknown *data distribution*, $ p$.
In the case of BC, this unknown data distribution $p$ represents the expert's joint distribution over $(o, a)$-pairs.
Thus, given a finite set of $N$ pairs $\mathcal D = \{ (o,a)_i \}_\{i=0\}^N $ used as an imitation learning target (and thus assumed to be i.i.d.), GM seeks to learn a *parametric* distribution $ p_\theta(o,a)$ such that (1) new samples $(o,a) \sim p_\theta(\bullet)$ resemble those stored in $\mathcal D $, and (2) high likelihood is assigned to the observed regions of the unobservable $ p$.
Likelihood-based learning provides a principled training objective to achieve both objectives, and it is thus extensively used in GM [@prince2023understanding].

### Variational Auto-Encoders

<ResponsiveImage src={Ch4TaskEffectOnPairs} alt="Intuitively, latent variable in a single latent model may contain information regarding the task being performed, which directly results in the likelihood of the same observation-action pair being different for two different tasks. When (A) picking a block the likelihood of a wide gripper's opening should be higher than narrower one, while it should be the opposite when (B) pushing the block." id="fig-fig:ch4-task-effect-on-pairs" />

*Intuitively, latent variable in a single latent model may contain information regarding the task being performed, which directly results in the likelihood of the same observation-action pair being different for two different tasks. When (A) picking a block the likelihood of a wide gripper's opening should be higher than narrower one, while it should be the opposite when (B) pushing the block.*

A common inductive bias used in GM posits samples $(o,a)$ are influenced from an unobservable latent variable $z \in Z$, resulting in
$$\begin\{equation\}

    p (o,a) = \int_\{\text\{supp\}\{Z\}\} p(o,a \vert z) p(z)
\end\{equation\}$$
Intuitively, in the case of observation-action pairs $(o, a)$ for a robotics application, $z $ could be some high level representation of the underlying task being performed by the human demonstrator.
In such case, treating $ p(o,a)$ as a marginalization over $\text\{supp\}\{Z\}$ of the complete joint distribution $p(o,a,z)$ natively captures the effect different tasks have on the likelihood of observation-action pairs.
Figure the referenced figure graphically illustrates this concept in the case of a (A) picking and (B) pushing task, for which, nearing the target object, the likelihood of actions resulting in opening the gripper---the higher $q_6$, the wider the gripper's opening---should intuitively be (A) high or (B) low, depending on the task performed.
While the latent space $Z $ typically has a much richer structure than the set of all actual tasks performed, the referenced figure still provides a solid framework to learn joint distribution conditioned on unobservable yet relevant factors.
Figure the referenced figure represents this framework of latent-variable for a robotics application: the true, $ z$-conditioned generative process on assigns *likelihood* $p((o,a) \vert z)$ to the single $(o,a)$-pair.
Using Bayes' theorem, one can reconstruct the *posterior* distribution on $\text\{supp\}\{Z\}$, $q_\theta(z \vert o,a)$ from the likelihood $p_\theta(o,a \vert z)$, *prior* $p_\theta(z)$ and *evidence* $p_\theta(o,a)$.
VAEs approximate the latent variable model presented in Section eq:BC-latent-variable) using an *approximate posterior* $q_\phi(z \vert o,a)$ while regressing parameters for a parametric likelihood, $p_\theta(o,a \vert z)$ (Figure Section fig:ch4-latent-variable-model).

<ResponsiveImage src={Ch4LatentVariableModel} alt="(A) The latent variable model in a robotics application regulates influence between observed ($o,a)$ variables and an unobservable latent variable. (B) VAEs approximate exact latent variable models by means of variational inference." id="fig-fig:ch4-latent-variable-model" />

*(A) The latent variable model in a robotics application regulates influence between observed ($o,a)$ variables and an unobservable latent variable. (B) VAEs approximate exact latent variable models by means of variational inference.*

Given a dataset $\mathcal D $ consisting of $ N$ i.i.d. observation-action pairs, the log-likelihood of all datapoints under $\theta $ (in Bayesian terms, the *evidence* $ p_\theta(\mathcal D)$) can thus be written as:

$$\log p_\theta(\mathcal D) &= \log \sum_\{i=0\}^N p_\theta ((o,a)_i) 
 &= \log \sum_\{i=0\}^N \int_\{\text\{supp\}\{Z\}\} p_\theta((o,a)_i \vert z) p(z) 
 &= \log \sum_\{i=0\}^N \int_\{\text\{supp\}\{Z\}\} \frac\{q_\theta(z \vert (o,a)_i)\}\{q_\theta(z \vert (o,a)_i)\} \cdot p_\theta((o,a)_i \vert z) p(z) 
 &= \log \sum_\{i=0\}^N \mathbb E_\{z \sim p_\theta(\bullet \vert (o,a)_i)\} \left[ \frac\{p(z)\}\{q_\theta(z \vert (o,a)_i)\} \cdot p_\theta((o,a)_i \vert z) \right], $$

where we used the referenced figure in the referenced figure, multiplied by $1 = \frac\{q_\theta(z \vert (o,a)_i)\}\{q_\theta(z \vert (o,a)_i)\}$ in the referenced figure, and used the definition of expected value in the referenced figure.

In the special case where one assumes distributions to be tractable, $p_\theta (\mathcal D)$ is typically tractable too, and $\max_\theta \log p_\theta(\mathcal D)$ provides a natural target for (point-wise) infering the unknown parameters $\theta $ of the generative model.
Unfortunately, the referenced figure is rarely tractable when the distribution $ p$ is modeled with approximators such as neural networks, especially for high-dimensional, unstructured data.

In their seminal work on Variational Auto-Encoders (VAEs), @kingmaAutoEncodingVariationalBayes2022 present two major contributions to learn complex latent-variable GMs on unstructured data, proposing (1) a tractable, variational lower-bound to the referenced figure as an optimization target to jointly learn likelihood and posterior and (2) high-capacity function approximators to model the likelihood $p_\theta(o,a\vert z)$ and (approximate) posterior distribution $q_\phi(z \vert o,a) \approx q_\theta(z \vert o,a)$.

In particular, the lower bound on the referenced figure (Evidence LOwer Bound, *ELBO*) can be derived from the referenced figure applying Jensen's inequality---$\log \mathbb\{E\}[\bullet] \geq \mathbb\{E\} [\log (\bullet)]$---yielding:

$$\log p_\theta(\mathcal D) &\geq \sum_\{i=0\}^\{N\} \left(
            \mathbb\{E\}_\{z \sim p_\theta(\cdot \vert (o,a)_i)\} \big[ \log p_\theta((o,a)_i \vert z) \big]
            + \mathbb\{E\}_\{z \sim p_\theta(\cdot \vert (o,a)_i)\} \left[ \log \left( \frac\{p(z)\}\{q_\theta(z \vert (o,a)_i)\} \right) \right]
        \right)
 &= \sum_\{i=0\}^\{N\} \left(
            \mathbb\{E\}_\{z \sim p_\theta(\cdot \vert (o,a)_i)\} \big[ \log p_\theta((o,a)_i \vert z) \big]
        - \text\{D\}_\{\text\{KL\}\} \big[ q_\theta(z \vert (o,a)_i) \Vert p(z) \big]
        \right) $$

The true, generally intractable posterior $p_\theta (z \vert o,a)$ prevents computing both the expectation and KL divergence terms in the referenced figure, and therefore @kingmaAutoEncodingVariationalBayes2022 propose deriving the ELBO using an *approximate* posterior $q_\phi(z \vert o,a)$, resulting in the final, tractable ELBO objective,

$$\text\{ELBO\}_\{\mathcal D\}(\theta, \phi) = \sum_\{i=0\}^\{N\} \left(
            \mathbb\{E\}_\{z \sim q_\phi(\cdot \vert (o,a)_i)\} \big[ \log p_\theta((o,a)_i \vert z) \big]
        - \text\{D\}_\{\text\{KL\}\} \big[ q_\phi(z \vert (o,a)_i) \Vert p(z) \big]
        \right)
        $$

From Jensen's inequality, maximizing ELBO results in maximizing the log-likelihood of the data too, thus providing a natural, tractable optimization target.
Indeed, expectations can be estimated using MC estimates from the learned distributions in the referenced figure, while the KL-divergence term can typically be computed in closed-form (1) modeling $q_\phi $ as a Gaussian $ q_\phi(z \vert o,a) = \mathcal N\big(\mu_\phi(o,a), \Sigma_\phi(o,a) \big)$ and (2) imposing a standard Gaussian prior on the latent space, $p(z) = \mathcal N(\mathbf\{0\}, \mathbf\{I\})$.

An intuitive explanation of the learning dynamics of VAEs can be given considering the equivalent case of *minimizing the negative ELBO*, which admits a particularly interpretable factorization

$$\min_{\theta, \phi} - \text{ELBO}_{\mathcal (o,a) \sim \mathcal D}(\theta, \phi) &= \min_{\theta, \phi}\mathbf{L^{\text{rec}}}(\theta) + \mathbf{L^{\text{reg}}}(\phi) 

\mathbf{L^{\text{rec}}}(\theta) &= \mathbb{E}_{z \sim q_\phi(\cdot \vert o,a} \big[ \log p_\theta(o,a \vert z) \big] 

\mathbf{L^{\text{reg}}}(\phi) &= \text{D}_{\text{KL}} \big[ q_\phi(z \vert o,a) \Vert p(z) \big] $

For any given  (o,a)  pair, the expected value term of the referenced figure is typically computed via MC estimates, resulting in
$-\mathbb{E}_{z \sim q_\phi(\bullet \vert o,a)} \big[ \log p_\theta(o,a \vert z) \big] = \mathbf{L^{\text{rec}}} \approx - \frac{1}{n} \sum_{i=0}^n \log p_\theta(o,a \vert z_i).$$
Assuming $p_\theta(o,a \vert z)$ is parametrized as an isotropic Gaussian distribution with mean $\mu_\theta (z) \in \mathbb R^d$ and variance $\sigma^2$, the log-likelihood thus simplifies to:
$$\log p(o,a \vert z_i) = -\frac\{1\}\{2\sigma^\{2\}\} \big \Vert (o,a)-\mu_\theta(z_i) \big\Vert_2^2 -\frac\{d\}\{2\}\log(2\pi \sigma^\{2\}) \implies \mathbf\{L^\text\{rec\}\} \approx \frac \{1\}\{n\} \sum_\{i=0\}^n \big\Vert (o,a) - \mu_\theta(z_i) \big \Vert^2_2$$
Indeed, it is very common in practice to approximate from the learned likelihood $p_\theta(o,a \vert z)$ as a parametric distribution (e.g. Gaussians) parametrized by some learned vector of coefficients derived from $\mu_\theta (z), \ z \sim p (\bullet)$.
In all such cases, learning a VAE corresponds to optimally *reconstructing* the examples in $\mathcal D $ by minimizing the L2-error---a very common *supervised learning* objective for regression targets---while regularizing the information compression into the latent, as under the common modeling choice $ p(z) = \mathcal N (\mathbf\{0\}, \mathbf\{I\})$ the referenced figure regularizes the posterior limiting the expressivity of $q_\phi(z\vert o,a)$.

### Diffusion Models

VAEs approximate probability distributions via a *single* latent variable model, assuming the underlying unknown distribution can be factored according to the referenced figure, and solve the variational inference problem of jointly learning the likelihood $p_\theta $ and (approximate) posterior $ q_\phi $ for such model.
In that, the unknown data distribution $ p(o,a)$ is effectively approximated via $\int_Z p(z) p_\theta(o,a \vert z)$, and the underlying generative process reproduced by (1) sampling a latent variable and (2) learning to decode it into a (ideally) high-likelihood sample under the (unknown) $p(o,a)$.
Diffusion Models (DMs) [@hoDenoisingDiffusionProbabilistic2020] are another class of GMs which treat the similar problem of approximating an underlying unknown data distribution---*variational inference*---by *partially* extending VAEs to the case where *multiple* latent variables influence each other and the generative process underlying $o,a$ itself.
In particular, DMs posit the generative process can be decomposed to a series of piece-wise (Markovian) interactions between (latent) variables (Figure Section fig:ch4-many-latents), resulting in

$$p(\underbrace\{o,a\}_\{= z_0\}) &= \int_\{\text\{supp\}\{Z_0\}\} \int_\{\text\{supp\}\{Z_1\}\} \hdots \int_\{\text\{supp\}\{Z_T\}\} p(z_0, z_1, \dots z_T) 

    p(z_0, z_1, \dots z_T) &= p(z_T) \prod_\{t=0\}^\{T\} p(z_\{t-1\} \vert z_t), $$

where we explicitly showed the marginalization over the multiple latents in the referenced figure, and used the law of conditional probability and Markov property in the referenced figure.

<ResponsiveImage src={Ch4ManyLatents} alt="HMLV models posit the data generation process is influenced by a stack of Markov-dependent latent variables, with samples from the posterior distribution being progressively higher up in the hierarchy." id="fig-fig:ch4-many-latents" />

*HMLV models posit the data generation process is influenced by a stack of Markov-dependent latent variables, with samples from the posterior distribution being progressively higher up in the hierarchy.*

Similarily to VAEs, providing an exact interpretation for the latent variables is typically not possible.
Still, one fairly reasonable application-driven intuition is that, by providing a model of the hierarchical, decoupled interaction of latent variables, Hierarchical Markov Latent Variable (HMLV) models attempt to capture the different resolutions at which different conditioning factors intervene, so that in a robotics application for instance, one could naturally distinguish between early-stage trajectory planning ($t \to T $) and fine-grained adjustments ($ t \to 0$).
In that, HMLV models thus provide a framework to perform variational inference via multiple, sequential sampling steps from different higher level distributions instead of approximating the generative process with a single-latent variable model.
DMs are a particular instantiation of HMLV models for which the posterior $q( z_t \vert z_\{t-1\}) = \mathcal N(z_t \sqrt\{1-\beta_t\}, \beta_t \mathbf\{I\})$ for a given $\beta_t \in \mathbb R^+$, thereby iteratively reducing the signal-to-noise ratio as $\beta_t $ increases along the latents hierarchy.

Just like VAEs, DMs attemp to learn to reproduce an underlying data distribution $ p (o,a)$ given a collection of i.i.d. samples approximating the model posited to have generated the data in the first place ( Section eq:BC-multi-latent-model-1).
Similarily to VAEs, DMs approximate the process of sampling from the unknown $p(o,a)$ (1) sampling from an easy-to-sample distribution (e.g., Gaussian) and (2) learning to reconstruct high-likelihood samples under the unknown distribution.
However, in stark contrast with VAEs, the easy-to-sample distribution contains *no mutual information* regarding the data distribution $p(o,a)$.
Crucially, as no information from the sample $(o,a)$ (denoted as $z_0 \equiv (o,a)$ for the sake of notation) is assumed to be propagated throughout the chain of latents, the posterior $q(z_t \vert z_\{t-1\})$ assumes a relatively amicable structure in DMs, reducing complexity.
The *true* likelihood $p(z_\{t-1\} \vert z_t)$ is instead typically approximated using the parametrization $p_\theta (z_\{t-1\} \vert z_t)$.
In that, the information contained in the unknwon data distribution is *reconstructed* via a process in which samples from a fixed distribution are turned into (ideally) high-likelihood samples under $p(o,a)$---a process referred to as *denoising*.

Under such model, we can express the log-likelihood of an arbitrary sample as[^2]

$$\log p_\theta (\underbrace\{o,a\}_\{= z_0\}) =
    &\mathbb\{E\}_\{z_1 \sim q(\bullet \vert z_0)\} \log p_\theta (z_0 \vert z_1) - 

    &\mathbb\{E\}_\{z_\{T-1\} \sim q(\bullet \vert z_0)\} \big[ \text\{D\}_\{\text\{KL\}\} (q(z_T \vert z_\{T-1\}) \Vert p(z_T) ) \big] - \notag

    &\sum_\{t=1\}^\{T-1\} \mathbb\{E\}_\{(z_\{t-1\}, z_\{t+1\}) \sim q(\bullet \vert z_0)\} \big[ \text\{D\}_\{\text\{KL\}\} (q(z_t \vert z_\{t-1\}) \Vert p_\theta(z_t \vert z_\{t-1\}) ) \big], \notag$$

providing an optimization target in the form of $\max_\theta \log p_\theta (\mathcal D)$.

In their seminal work on using DMs for variational inference, @hoDenoisingDiffusionProbabilistic2020 introduce major contributions regarding solving $\min_\theta -\log p_\theta(o,a)$.
In particular, @hoDenoisingDiffusionProbabilistic2020 exclusively adopt a fixed *Gaussian* posterior in the form of $q(z_t \vert z_\{t-1\}) = \mathcal\{N\}(\sqrt\{1-\beta_t\}z_\{t-1\}, \beta_t \mathbf I)$.
The choice of adopting Gaussians has profound implications on the generative process modeled.
Indeed, under the (mild) assumption that the variance is sufficiently small $\beta_t \leq \eta, \eta \in \mathbb R^+$, @sohl-dicksteinDeepUnsupervisedLearning2015 proved that the likelihood $p(z_\{t-1\} \vert z_t)$ is Gaussian as well, which allows for the particularly convenient parametrization of the approximate likelihood $p_\theta (x_\{t-1\} \vert x_t) = \mathcal N(\mu_\theta(x_t, t), \Sigma_\theta(x_t,t)), \ t \in [1,T]$, as well as for closed-form tractability of the KL-divergence terms in the referenced figure.
Further, the posterior's structure also enables an analytical description for the distribution of the $t $-th latent variable, $ q(z_t \vert z_0) = \mathcal N (\sqrt\{\bar\{\alpha\}_t\}z_0, (1-\bar\{\alpha\}_t) \mathbf\{I\})$, with $\alpha_t = 1-\beta_t, \ \bar \alpha_t = \prod_\{k=1\}^t \alpha_k $, which conveniently prevents iterative posterior sampling.

<ResponsiveImage src={Ch4DiffusionRobotActions} alt="DMs iteratively corrupt samples (left) from an unknown distribution into a quasi-standard Gaussian (center), learning the displacement field (right) that permits to reconstruct samples from the unknown target distribution by iteratively denoising samples of a tractable, easy-to-sample distribution." id="fig-fig:diffusion-robot-actions" />

*DMs iteratively corrupt samples (left) from an unknown distribution into a quasi-standard Gaussian (center), learning the displacement field (right) that permits to reconstruct samples from the unknown target distribution by iteratively denoising samples of a tractable, easy-to-sample distribution.*

Finally, adopting Gaussian posteriors permits a particularly pleasing interpretation of the dynamics of training DMs [@permenterInterpretingImprovingDiffusion2024].
By using Gaussian posteriors, the hierarchical latent variables effectively lose increasingly more information circa the original (unknown) distribution's sample, $ z_0$, increasingly distributing according to a standard Gaussian and thus containing no information at all (Figure Section fig:diffusion-robot-actions).
Figure the referenced figure illustrates this procedure on a simplified, bidimensional observation-action distribution, where we considered $o=q_2$ and $a=q^h_2$, with $q_2$ representing the robot's *elbow flex* actuation and $q^h_2$ the human teleoperator's robot elbow flex.

<ResponsiveImage src={Ch4ActionVsObservationDistribution} alt="A joint action-observation distribution, in the simplified case where the observation is the elbow-flex actuation in a SO-100, and the action is the recorded position for the same joint in the teleoperator arm. The motion recorded being teleoperated, the points distribute along a the diagonal." id="fig-fig:ch4-action-vs-observation-distribution" />

*A joint action-observation distribution, in the simplified case where the observation is the elbow-flex actuation in a SO-100, and the action is the recorded position for the same joint in the teleoperator arm. The motion recorded being teleoperated, the points distribute along a the diagonal.*

Because the recorded behavior is teleoperated, measurements mostly distribute along the line $a = o + \eta, \eta \sim N(0,1)$, with $\eta$-variability accouting for minor control inconsistencies (Figure Section fig:ch4-action-vs-observation-distribution).
Using Gaussian posteriors---i.e., adding Gaussian noise---effectively simulates a *Brownian motion* for the elements in the distribution's support (in Figure the referenced figure, $\mathcal\{O\} \times \mathcal\{A\}$), whereby information *diffuses away* from the samples, and comparing the diffused samples to the original data points one can derive an estimate of the total displacement induced by diffusion.
Under the only assumption that the likelihood of the diffused samples is low under the original unknown data distribution, then one can effectively approximate the unkwown distribution by learning to *reverse* such displacement.
This key intuition allows to write a simplified training objective:

$$
    \mathcal L(\theta) = \mathbb\{E\}_\{t, z_0, \epsilon\} \big[
        \Vert \epsilon - \epsilon_\theta(\sqrt\{\bar \alpha_t\} z_0 + \epsilon \sqrt\{1 - \bar \alpha_t\}, t) \Vert^2 \big], \quad t \sim \mathcal\{U\}(\{1,\dots,T\}), \quad
        z_0 \sim \mathcal\{D\}, \quad
        \epsilon \sim \mathcal\{N\}(\mathbf\{0\},\mathbf\{I\}).$$

In this simplified (minimization) objective, the optimization process differs from the referenced figure in that, rather than maxizing $p_\theta$ directly, the parameters $\theta $ of the pairwise likelihood $ p_\theta(z_\{t-1\} \vert z_t)$ are adjusted to *predict the total displacement* $\epsilon $ for a randomly long ($ t \sim \mathcal\{U\}(\\{1,\dots,T\\}$ )) diffusion process starting from a sample of the target distribution.

By learning the total displacement from a generally, uninformative corrupted sample obtained diffusing information and a sample from an unknown distribution---significant ($\Vert \epsilon \Vert > 0$) whenever input and target distribution are sufficiently different--- @hoDenoisingDiffusionProbabilistic2020 show that one can approximate the underlying distribution reversing the displacement, *denoising* samples.
Interestingly, under the hypothesis real-world data belongs to a single higher dimensional manifold (Manifold Hypothesis), @permenterInterpretingImprovingDiffusion2024 show that diffusion learns the gradient of a distance function from any off-point manifold (such as perturbed, uniformative samples), and the data manifold itself.
Following this gradient---i.e., denoising a sample from an uninformative distribution---corresponds to projecting back into the manifold, yielding a procedure to sample from unknown distributions by means of Euclidean projection.
Indeed, under the assumption that $p_\theta (z_\{t-1\} \vert z_t)$ is Gaussian, then sampling $z_\{t-1\} \sim p_\theta(\bullet \vert z_\{t\})$ corresponds to computing

$$z_\{t-1\} = \frac\{1\}\{\sqrt\{\alpha_t\}\} \left( z_t - \frac\{\beta_t\}\{\sqrt\{1 - \bar\alpha_t\}\} \epsilon_\theta(z_t, t) \right) + \sigma_t \epsilon, \quad \epsilon \sim \mathcal N(\mathbf\{0\}, \mathbf\{I\}), $$

thus showing that the lower-level latent variables in a DM can be obtained by iteratively removing noise from the one-step higher order variable, using the noise regressor $\epsilon_\theta(z_t, t)$ learned minimizing the referenced figure.

### Flow Matching

The posterior parametrization adopted by DMs proved traditionally effective, yet it raised concerns circa its efficiency at inference time, where a possibly large of compute-expensive denoising steps are needed in order to recover a sample from the target distribution.
Flow Matching (FM) [@lipmanFlowMatchingGenerative2023] extends DMs to the general case of arbitrary, parametrized likelihood and posteriors, and in this defines a superseding class of GMs providing a unified framework for learning *continuous transformations* between distributions, encompassing and generalizing DMs.
Instead of a *stochastic, discrete, multi-step* denoising process, FM aims to learn a *deterministic, continuous, differentiable flow* $\psi [0,1] \times Z \mapsto Z $, formalized starting from possibly time-dependent vector field $ v: [0,1] \times Z \mapsto Z $ transporting samples from a simple prior distribution $ p_0$---e.g., a standard Gaussian---to a more complex, potentially unknown data distribution $p_1$ over time.
Note how FM models time $t \in [0,1]$ to be varying continuously while moving away *from* an easy-to-sample distribution $p_0$ *towards* the unknown data-distribution, $p_1$.
This results in a continuous and deterministic trajectory for each sample, which can be more efficient to generate compared to the stochastic paths of DMs.
Formally, FM can be fully characterized by an ordinary differential equation (ODE) relating instantaneous variations of flows with the underlying vector field, and hence providing complete trajectories over the distributions' support when integrating over time,

$$\frac\{d\}\{dt\} \psi(z, t) &= v(t, \psi(t, z))

    \psi(0, z) &= z$$

FM proved very effective in a variety of applications, ranging from image [@esserScalingRectifiedFlow2024] and video generation [@polyakMovieGenCast2025] to robotics control [@black $p_0$ VisionLanguageActionFlow2024].
Most notably, in their introductory work on FM for GM, @lipmanFlowMatchingGenerative2023 show how DMs can be seen as a specific instance of FM where the *conditional* target vector field $u$ approximated by the noise regressor corresponds to
$$\begin\{equation\}

    u(t, z\vert z_0) = \frac\{\frac\{d\}\{dt\}\alpha(1-t)\}\{1 - (\alpha(1-t))^2\}(\alpha(1-t)z - z_0), \quad \alpha(t) = e^\{-\frac12 \int_0^t \beta(s) ds\}, \quad \forall z_0 \in \mathcal D
\end\{equation\}$$
Note that the traditional discrete-time noise-scheduler $\{\beta_t\}_\{t=0\}^T$ is now generalized to a continuous map $\beta : [0,1] \mapsto \mathbb R^+$.
Crucially, @lipmanFlowMatchingGenerative2023 prove that by exclusively optimizing the vector field for individual data points $z_0 \in \mathcal D $ individually, one also retrieves the optimal flow to morph the entire support of the initial distribution $ p_0$ into $p_1 \ \text\{s.t.\} \mathcal D \sim p_1$.

<ResponsiveImage src={Ch4NormalizingFlows} alt="Probability distributions can be modified applying vector fields resulting in a flow of mass in the support. When acting over time, vector fields can effectively change the distribution's structure." id="fig-fig:ch4-normalizing-flows" />

*Probability distributions can be modified applying vector fields resulting in a flow of mass in the support. When acting over time, vector fields can effectively change the distribution's structure.*

While the noising schedule of DMs results in a stochastic process that resembles a random walk, FM allows for more general---potentially, deterministic---likelihood and posterior parametrization.
In the FM literature the likelihood and posterior probabilty densities defined along a HMLV model are typically jointly referred to as a *probability path*, where the distributions for successive adjacent transitions in the HMLV model are related by the (normalized) flow between them (Figure Section fig:ch4-normalizing-flows).
The inherent flexibility of FM is one of their key advantages over DMs, as it opens up the possibility of *learning* more efficient paths.
For instance, one can design probability paths inspired by Optimal Transport (OT)---a subdiscipline studying the problem of finding the most efficient way to morph one probability distribution into another.
Probability paths obtained through OT paths tend to be *straighter* than diffusion paths (Figure Section fig:ch4-diffusion-paths-versus-fm), which can lead to faster and more stable training, as well as higher-quality sample generation with fewer steps at inference time.
By avoiding unnecessary backtracking associated with the inherent stochastic nature of both the noising and denoising process in DMs, test-time compute is typically significantly reduced, while retaining comparable results [@lipmanFlowMatchingGenerative2023].

<ResponsiveImage src={Ch4DiffusionVsFlowmatching} alt="Compared to diffusion, flow matching distorts distribution along a less randomic pattern, resulting in a clearer interpolation between source and target distribution. The visualization shows an example comparison between these two methods on joint distribution of robot observations and actions over $T=50$ steps." id="fig-fig:ch4-diffusion-paths-versus-fm" />

*Compared to diffusion, flow matching distorts distribution along a less randomic pattern, resulting in a clearer interpolation between source and target distribution. The visualization shows an example comparison between these two methods on joint distribution of robot observations and actions over $T=50$ steps.*

In practice, FM can be applied to generative modeling by learning a vector field regressor $v_\theta(z, t)$ to approximate a given target vector field $u(t, z)$.
In the particular case of DMs, $u(t, z)$ is defined as in the referenced figure, while in priciple the target vector field can be learned to induce a particular transportation, or fixed according to OT.
Given a sample from the data distribution $z_1 \sim p_1$ and a sample from an easy-to-sample prior $z_0 \sim p_0$, CFM defines a simple path between them using *linear interpolation* between samples $z_t = (1-t)z_0 + t z_1$, resulting in the target vector field $u(t, z_t) = z_1 - z_0$.
Then, a FM model can be trained with the simple regression objective defined as

$$
    \mathcal L(\theta) = \mathbb\{E\}_\{t, z_0, z_1\} \big[
        \Vert v_\theta((1-t)z_0 + t z_1, t) - (z_1 - z_0) \Vert^2 \big], \quad t \sim \mathcal\{U\}([0,1]),$$

where $z_0 \sim p_0(\bullet)$ and $z_1 \sim p_1(\bullet)$. Note how in the referenced figure---differently from the referenced figure---time is assumed to be varying continuously $t \sim \mathcal U([0,1])$ rather than discretely $t \sim \mathcal U(\\{0,1\\})$, a key property of flow-based models.
The objective in the referenced figure directly regresses the learned vector field onto the simple, straight path connecting a point from the prior and a point from the data, providing a simulation-free training procedure that is both stable and efficient.
At inference time, samples are generated by starting with $z_0 \sim p_0$ and iteratively refined according to $\frac\{dz\}\{dt\} = v_\theta(z_t, t)$ for $t \in [0,1]$---an operation that can be numerically carried out with standard ODE solvers.

## Action Chunking with Transformers

While GMs prove useful in learning complex, high-dimensional multi-modal distributions, they do not natively address the compouding errors problem characteristic of online, sequential predictions.
In Action Chunking with Transformers (ACT), @zhaoLearningFineGrainedBimanual2023 present an application of VAEs to the problem of learning purely from offline trajectories, introduce a simple, yet effective method to mitigate error compounding, learning high-fidelity autonomous behaviors.
Drawing inspiration from how humans plan to enact atomically sequences of the kind $a_\{t:t+k\}$ instead of single actions $a_t $, @zhaoLearningFineGrainedBimanual2023 propose learning a GM on a dataset of input demonstrations by modeling *action chunks*.
Besides contributions to learning high-performance autonomous behaviors, @zhaoLearningFineGrainedBimanual2023 also introduce hardware contributions in the form of a low-cost bimanual robot setup (ALOHA) capable of performing fine-grained manipulation tasks, such as opening a lid, slotting a battery in its allotment or even prepare tape for application.

On the robot learning side of their contributions, @zhaoLearningFineGrainedBimanual2023 adopt transformers as the architectural backbone to learn a *Conditional* VAE [@sohnLearningStructuredOutput2015].
Conditional VAEs are a variation of the more standard VAE formulation introducing a conditioning variable on sampling from the latent prior, allowing the modeling of *one-to-many* relationships between latent and data samples.
Further, in stark contrast with previous work [@florenceImplicitBehavioralCloning2022,jannerPlanningDiffusionFlexible2022], @zhaoLearningFineGrainedBimanual2023 do not learn a full joint $ p_\theta(o,a)$ on observation and actions.
While the *policy* distribution $p_\theta(a \vert o)$ can in principle be entirely described from its joint $p_\theta(o,a)$, it is often the case that the conditional distribution is intractable when using function approximators, as $p_\theta(a \vert o) = \tfrac\{p_\theta(o,a)\}\{\int_\mathcal\{A\} p_\theta(o,a)\}$ and the integral in the denominator is typically intractable.
Instead of modeling the full joint using a vanilla VAE, @zhaoLearningFineGrainedBimanual2023 propose learning a *conditional* VAE [@sohnLearningStructuredOutput2015] modeling the policy distribution directly $p (a \vert o)$.

In practice, when learning from demonstrations adopting CVAEs results in a slight modification to the VAE objective in the referenced figure, which is adapted to

$$
    \text\{ELBO\}_\{\mathcal D\}(\theta, \phi, \omega) = \sum_\{i=0\}^\{N\} \left(
            \mathbb\{E\}_\{z \sim q_\phi(\cdot \vert o_i, a_i)\} \big[ \log p_\theta(a_i \vert z, o_i) \big]
        - \text\{D\}_\{\text\{KL\}\} \big[ q_\phi(z \vert o_i, a_i) \Vert p_\omega(z \vert o_i) \big]
        \right)$$

Notice how in the referenced figure we are now also learning a new set of parameters $\omega$ for the prior distribution in the latent space.
Effectively, this enables conditioning latent-space sampling (and thus reconstruction) during training, and potentially inference, providing useful when learning inherently conditional distributions like policies.
Further, ACT is trained as a $\beta$-CVAE [@higgins2017beta], using a weight of the KL regularization term in the referenced figure as an hyperparameter regulating the information condensed in the latent space, where higher $\beta$ results in a less expressive latent space.

In their work, @zhaoLearningFineGrainedBimanual2023 ablated using a GM to learn from human demonstrations compared to a simpler, supervised objective, $\mathcal L_1(a,a^\prime) = \Vert a - a^\prime \Vert_1$.
Interestingly, they found the performance of these two approaches to be comparable when learning from *scripted* demonstrations.
That is, when learning from data collected rolling out a predetermined set of commands $[q^c_0, q^c_1, \dots]$, GM did *not* prove competitive compared to standard supervised learning.
However, when learning from human demonstrations---i.e., from data collected executing commands coming from a human controller $[q^h_0, q^h_1, \dots]$---they found performance (success rate on a downstream task) to be severily (-33.3 The authors also ablate the action chunking paradigm, reporting significant performance gains for performing action chunking (1 To avoid acting openloop, @zhaoLearningFineGrainedBimanual2023 design an inference process consisting in performing inference at every timestep $t $ and then aggregate overlapping chunks using chunks' exponential moving average.

<ResponsiveImage src={Ch4Act} alt="Action Chunking with Transformer (ACT), as in [@zhaoLearningFineGrainedBimanual2023]. ACT introduces an action chunking paradigm to cope with high-dimensional multi-modal demonstration data, and a transformer-based CVAE architecture." id="fig-fig:ch4-act" />

*Action Chunking with Transformer (ACT), as in [@zhaoLearningFineGrainedBimanual2023]. ACT introduces an action chunking paradigm to cope with high-dimensional multi-modal demonstration data, and a transformer-based CVAE architecture.*

In ACT (Figure Section fig:ch4-act), inference for a given observation $ o \in \mathcal O $ could be performed by (1) computing a prior $ p_\omega(z \vert o)$ for the latent and (2) decoding an action chunk from a sampled latent $z \sim p_\omega(\bullet \vert o)$, similarily to how standard VAEs generate samples, with the exception that vanilla VAEs typically pose $p(z\vert o) \equiv p(z) \sim N(\mathbf\{0\}, \mathbf\{I\})$ and thus skip (1).

<ResponsiveImage src={Ch4ActEncoder} alt="The CVAE encoder used in ACT. Input action chunks are first embedded and aggregated with positional embeddings, before being processed alongside embedded proprioperceptive information, and a learned `[`CLS] token used to aggregate input level information, and predict the style variable $z $. The encoder is entirely disregarded at inference time." id="fig-fig:ch4-act-encoder" />

*The CVAE encoder used in ACT. Input action chunks are first embedded and aggregated with positional embeddings, before being processed alongside embedded proprioperceptive information, and a learned `[`CLS] token used to aggregate input level information, and predict the style variable $ z$. The encoder is entirely disregarded at inference time.*

However, the authors claim using a deterministic procedure to derive $z $ may benefit policy evaluation, and thus avoid sampling from the conditional prior at all.
At test time, instead, they simply use $ z = \mathbf\{0\}$, as the conditional prior on $z $ used in training is set to be the unit Gaussian.
At test time, conditioning on the observation $ o$ is instead achieved through explicitly feeding proprioperceptive and visual observations to the decoder, $p_\theta(a \vert z, o)$, while during training $z $ is indeed sampled from the approximate posterior distribution $ p_\phi(z \vert o, a)$, which, however, disregards image observations and exclusively uses proprioperceptive states to form $o $ for efficiency reasons (as the posterior $ q_\phi $ is completely disregarded at test time).

<ResponsiveImage src={Ch4ActDecoder} alt="The CVAE decoder used in ACT, comprising of a full encoder-decoder Transformer architecture. Camera observations from all $ n$ camera views are first embedded using pre-trained visual encoders, and then concatenated to the corresponding positional embeddings. Then, alongside embeddings for the proprioperceptive information available and the style variable $z $ retrieved from the CVAE encoder, the Transformer encoder shares the matrices $ K,Q $ with the Transformer decoder, trained to decode fixed position embeddings into action valid chunks." id="fig-fig:ch4-act-decoder" />

*The CVAE decoder used in ACT, comprising of a full encoder-decoder Transformer architecture. Camera observations from all $ n$ camera views are first embedded using pre-trained visual encoders, and then concatenated to the corresponding positional embeddings. Then, alongside embeddings for the proprioperceptive information available and the style variable $z $ retrieved from the CVAE encoder, the Transformer encoder shares the matrices $ K,Q $ with the Transformer decoder, trained to decode fixed position embeddings into action valid chunks.*

### Code Example: Learning ACT

## Diffusion Policy

DMs proved very effective in approximating complex highly dimensional distributions, such as distributions over images [@hoDenoisingDiffusionProbabilistic2020] or videos [@polyakMovieGenCast2025], thanks to their inherent capability to deal with multimodal data and training stability.
In Diffusion Policy (DP), @chiDiffusionPolicyVisuomotor2024 present an application of DMs the field of robot learning, leveraging diffusion to model human expert demonstrations in a variety of simulated and real-world tasks.
Similarily to Action Chunking with Transformer [@zhaoLearningFineGrainedBimanual2023], @chiDiffusionPolicyVisuomotor2024 (1) adopt a modified *observation-conditioned target distribution* instead of the full joint $ p(o,a)$ and (2) predict multiple actions into the future instead of a single action.
Besides the intractability of the observations' marginal $p_\theta(o)$ given $p_\theta(o,a)$, DP's rationale for modeling the data distribution via $p_\theta(a \vert o)$ stems from the rather test-time compute intensive nature of diffusion, whereby generating actions *alongside* observations is likely to result in higher complexity and thus a likely larger number of denoising operations, which would prove ultimately pointless considering robotics applications rely on the capability to generate controls rather than reproducing observations.

In practice, conditioning on observation data is achieved conditioning the added noise regressor $\epsilon_\theta $ introduced in the referenced figure on a stack of $ T_o$ observations, resulting in the *conditional* simplified diffusion objective

$$\mathcal L(\theta) &= \mathbb\{E\}_\{t, a_\{t:t+H_a\}, \epsilon\} \big[
        \Vert \epsilon - \epsilon_\theta(\sqrt\{\bar \alpha_t\} a_\{t:t+T_a\} + \epsilon \sqrt\{1 - \bar \alpha_t\}, t, o_\{t-T_o:t\}) \Vert^2 \big], 

        & t \sim \mathcal\{U\}(\{1,\dots,T\}), \quad
        a_\{t:t+T_a\}, o_\{t-T_o:t\} \sim \mathcal\{D\}, \quad
        \epsilon \sim \mathcal\{N\}(\mathbf\{0\},\mathbf\{I\}). \notag$$

Notice how in the referenced figure the noise regressor is conditioned both on the latent variable rank $t $ *and* on a stack of previous observations $ o_\{t-T_o:t\}$.
 @chiDiffusionPolicyVisuomotor2024 claim the combination of (1) conditioning on a horizon of previous observations and (2) predicting multiple actions into the future allows DP to *commit to specific modes* in the data at inference time, which proves essential for good performance and avoiding undecisiveness.

<ResponsiveImage src={Ch4DiffusionPolicy} alt="The Diffusion Policy archicture, as in [@chiDiffusionPolicyVisuomotor2024]. A stack of $H_o $ previous observations is used as external conditioning to denoise a group of $ H_a $ actions. Conditioning is used at every layer of a U-Net block, and in practice allows to obtain fully-formed action chunks with as little as $ T=10$ denoising steps." id="fig-fig:diffusion-policy-architecture" />

*The Diffusion Policy archicture, as in [@chiDiffusionPolicyVisuomotor2024]. A stack of $H_o $ previous observations is used as external conditioning to denoise a group of $ H_a $ actions. Conditioning is used at every layer of a U-Net block, and in practice allows to obtain fully-formed action chunks with as little as $ T=10$ denoising steps.*

Figure the referenced figure shows the convolution-based version of the architecture proposed by @chiDiffusionPolicyVisuomotor2024, illustrating inference on a single sample from $\mathcal D $ for simplicity.
An arbitrarily noisy chunk of $ H_a$ actions $\tilde a_\{t:t+H_a\}$ is mapped to a learned high-dimensional space.
Similarily, both image observations and poses are embedded before being aggregated to the action embeddings.
Then, a U-Net [@ronnebergerUNetConvolutionalNetworks2015] is trained to regress the noise added into $\tilde a_\{t:t+H_a\}$, using observation conditioning information at every layer and seeking to optimize the referenced figure.
At inference time, the noise predictor is used to predict the quantity of noise at every $t \in [T, \dots, 0 ]$ and iteratively subtract it from $\tilde a_\{t:t+T_a\}$, reversing the diffusion process simulated in training conditioned on $o_\{t-T_o:t\}$ to predict $a_\{t:t+T_a\}$.

Training using 50-150 demos (15-60 minutes of teleoperation data) DP achieves strong performance on a variety of simulated and real-world tasks, including dexterous and deformable manipulation tasks such as sauce pouring and mat unrolling.
Notably, the authors ablated the relevance of using RGB camera streams as input to their policy, and observed how high frame-rate visual observations can be used to attain performance (measured as success rate) comparable to that of state-based policies, typically trained in simulation with priviledged information not directly available in real-world deployments.
As high-frame rate RGB inputs naturally accomodate for dynamic, fast changing environments, @chiDiffusionPolicyVisuomotor2024's conclusion offers significant evidence for learning streamlined control policies directly from pixels.
In their work, @chiDiffusionPolicyVisuomotor2024 also ablate the performance of DP against their baseline against the size of the dataset collected, showing that DP outperforms the considered baseline for every benchmark size considered.
Further, to accelerate inference, @chiDiffusionPolicyVisuomotor2024 employ Denoising Diffusion Implicit Models [@songDenoisingDiffusionImplicit2022], a variant of Denoising Diffusion Probabilistic Models [@hoDenoisingDiffusionProbabilistic2020] (DDPM) adopting a strictly deterministic denoising paradigm (differently from DDPM's natively stochastic one) inducing the same final distribution's as DDPM's, and yet resulting in 10 times less denoising steps at inference time [@chiDiffusionPolicyVisuomotor2024].
Across a range of simulated and real-world tasks, @chiDiffusionPolicyVisuomotor2024 find DPs particularly performant when implementing a transformer-based network as $\epsilon_\theta$, although the authors note the increased sensitivity of transformer networks to hyperparameters and thus explicitly recommend starting out with a simpler, convolution-based architecture for diffusion (Figure Section fig:diffusion-policy-architecture), which are however reported to be biased towards learning low-frequency components [@tancikFourierFeaturesLet2020] and thus may prove more challenging to train with non-smooth action sequences.

### Code Example: Learning Diffusion Policies

## Optimized Inference

Modern visuomotor policies output *action chunks*--sequences $\pi(o_t) = \mathbf\{A\}_t$ with $\mathbf\{A\}_t = \bigl(a_t,a_\{t+1\},\dots,a_\{t+H_a\}\bigr)$ being a sequence of $H_a \gg 1$ low-level commands enqueued in an action queue, originating from an environment observation, $o_t$.
Predicting series of actions instead of single commands proved essential in learning complex, multi-modal behavior [@zhaoLearningFineGrainedBimanual2023,chiDiffusionPolicyVisuomotor2024].

Typically, the robot executes the entire action chunk $\mathbf\{A\}_t $, before a new observation $ o_\{t+H_a\}$ is passed to the policy $\pi $ to predict the next chunk.
This results in open-loop inference in between observations captured every $ H_a$ timesteps.
 @zhaoLearningFineGrainedBimanual2023 adopts a different strategy whereby the robot controller interleaves chunk prediction $\mathbf\{A\}_t \gets \pi(o_t)$ and chunk consumption $a_t \gets **PopFront(( \mathbf\{A\}_t $)** , computing a new chunk of actions at every timestep $ t$ and aggregating the predicted chunks on overlapping sections.
While adaptive---every observation at every timestep $o_t$ is processed---such approaches rely on running inference continuously, which can be prohibitive in resource-constrained scenarios, such as edge deployments.

A less resource-intensive approach is to entirely exhaust the chunk $\mathbf\{A\}$ before predicting a new chunk of actions, a strategy we refer to as *synchronous* (sync) inference.
Sync inference efficiently allocates computation every $H_a$ timesteps, resulting in a reduced average computational burden at control time.
In contrast, it inherently hinders the responsiveness of robot systems, introducing blind lags due to the robot being *idle* while computing $\mathbf\{A\}$.

We directly assess the lack of adaptiveness of robot systems due to acting open-loop, and the presence of lags at runtime by decoupling action chunk prediction $\mathbf\{A\}$ from action execution $a_t \gets **PopFront**(\mathbf\{A\}_t)$, developing an *asynchronous* (async) inference stack (Section alg:async-inference), whereby a $**RobotClient**$ sends an observation $o_t$ to a $**PolicyServer**$, receiving an action chunk $\mathbf\{A\}_t $ once inference is complete (Section fig:ch4-async-inference).
In this, we avoid execution lags by triggering chunk prediction while the control loop is still consuming a previously available queue, aggregating it with the newly incoming queue whenever available.
In turn, async-inference tightens the loop between action prediction and action execution, by increasing the frequency at which observations are processed for chunk prediction.
Crucially, decoupling action prediction from action execution also directly allows to allocate more computational resources on a remote policy server sending actions to the robot client over networks, something which may prove very effective in resource-constrained scenarios such as low-power robots.

<ResponsiveImage src={Ch4AsyncInference} alt="**Asynchronous inference**. Illustration of the asynchronous inference stack. Note that the policy can be run on a remote server, possibly with GPUs." id="fig-fig:ch4-async-inference" />

***Asynchronous inference**. Illustration of the asynchronous inference stack. Note that the policy can be run on a remote server, possibly with GPUs.*

"'
Algorithm:

Asynchronous inference control-loop
alg:robotclient
algorithmic[1]
Input: horizon $ T$, chunk size $H_a $, threshold $ g[0,1]$
Init: capture $o_0$; send $o_0$ to PolicyServer;
receive $_0  (o_0)$
$t $ to $ H_a $
$ a_t  \{PopFront\}(_t)$
Execute($a_t $) execute action at step $ t$
$\{|_t|\}\{H_a\} < g $ queue below threshold
capture new observation, $ o_\{t+1\}$
NeedsProcessing $(o_\{t+1\})$ similarity filter, or triggers direct processing
async_handle $\{AsyncInfer\}(o_\{t+1\})$
Trigger new chunk prediction (non blocking)
$\{\}_\{t+1\}  (o_\{t+1\})$ New queue is predicted with the policy
$_\{t+1\}  f(_t,\{\}_\{t+1\})$ aggregate overlaps (if any)

NotCompleted(async_handle)
$_\{t+1\}  _t$ No update on queue (inference is not over just yet)

algorithmic
alg:async-inference

"'

#### Implementation details

*Async* inference (1) tightens the control loop by capturing observations more often, directly eliminates idle gaps at runtime, and (2) directly allows to run inference on more powerful computational resources than the ones typically available onboard autonomous robotic platforms.

Algorithmically, we attain (1) on the **RobotClient**-side by consuming actions from a readily available queue until a threshold condition on the number of remaining actions in the queue ($\vert \mathbf\{A\}_t \vert / H_a < g$) is met. When this condition is triggered, a new observation of the environment is captured and sent to the (possibly remote) **PolicyServer**.
To avoid redundant server calls and erratic behavior at runtime observations are compared in joint-space, and near-duplicates are dropped.
Two observations are considered near-duplicates if their distance in joint-space is under a predetermined threshold, $\epsilon \in \mathbb R_+$.
Importantly, when the queue available to robot client eventually becomes empty, the most recent observation is processed regardless of similarity.

Interestingly, the behavior of async inference can be studied analytically. First, let $\ell$ be a random variable modeling the time needed to receive an action chunk $\mathbf\{A\}$ after sending an observation $o $, i.e. the sum of (1) the time to send across the observation $ o$ between the **RobotClient** and **PolicyServer**, $t_\{C \to S\}$ (2) the inference latency on the **PolicyServer**, $\ell_S$ and (3) the time to send $\mathbf\{A\}$ between the **PolicyServer** and **RobotClient**, $t_\{S \to C\}$. Assuming independence, $\mathbb E [\ell] = \mathbb E[t_\{C \to S\}] + \mathbb E[\ell_S] + \mathbb E[t_\{S \to C\}]$ which can be further simplified to $\mathbb E[\ell] \simeq \mathbb E[\ell_S]$, assuming communication time is (1) equal in both directions and (2) negligible with respect to the inference latency. Second, let $\Delta t$ be the environment's control cycle. With a real-world frame-rate of 30 frames per second, $\Delta t=33\text\{ms\}$. Consequently, exhausted queues at runtime--i.e. being idle awaiting for a new chunk--are avoided for $g \geq \frac\{\mathbb E[\ell_S] / \Delta t\}\{H_a\}$. In this, the queue threshold $g$ plays a major role relatively to the availability of actions to the **RobotClient**.

the referenced figure illustrates how the size of the action chunk $\lvert \mathbf\{A\}_t \rvert $ evolves over time for three representative values of $ g$, detailing the following key scenarios:

- **Sequential limit $(g=0)$.** The client drains the entire chunk before forwarding a new observation to the server. During the round-trip latency needed to compute the next chunk, the queue is empty, leaving the robot *incapable of acting*. This reproduces the behavior of a fully sequential deployment and results in an average of $\mathbb E[\ell_S]$ idle seconds.

- **Asynchronous inference $(g \in (0,1))$.** Allowing the client to consume $1-g$ of its available queue $\mathbf\{A\}_\{t-1\}$ before triggering inference for a new action queue $\mathbf\{A\}_\{t\}$, amortizing computation while keeping the queue from emptying. The overlap between successive chunks provides a buffer against modeling errors without the full cost of the $g=1$ regime. The updated queue $\mathbf\{A\}_t$ is obtained aggregating queues on the overlapping timesteps between $\mathbf\{A\}_\{t-1\}$ and the incoming $\tilde\{\mathbf\{A\}\}_\{t\}$.

- **Compute-intensive limit $(g=1)$.** As an extreme case, and in keeping with \@zhaoLearningFineGrainedBimanual2023, an observation is sent at *every* timestep. The queue is therefore almost always filled, with only a minor saw-tooth due to$\Delta t/\mathbb E[\ell_s] < 1$. While maximally reactive, this setting incurs one forward pass per control tick and can prove prohibitively expensive on limited hardware. Importantly, because the client is consuming actions while the server computes the next chunk, the available queue never gets filled again.

<ResponsiveImage src={Ch4Queues} alt="Action queue size evolution at runtime for various levels of $g $ when (A) not filtering out observation based on joint-space similarity and (B) filtering out near-duplicates observation, measuring their similarity in joint-space." id="fig-fig:ch4-queues" />

*Action queue size evolution at runtime for various levels of $ g$ when (A) not filtering out observation based on joint-space similarity and (B) filtering out near-duplicates observation, measuring their similarity in joint-space.*

the referenced figure emphasizes the trade-off governed by $g $: small values place result in idle periods, whereas $ g\approx 1$ assumes a highly accurate model and pays a significant compute price. In practice, choosing $g\in(0,1)$ allows to strike a balance between reactivity against resource budgets.
If not for the aforementioned similarity filter, the **RobotClient** would send observations for processing every $(1 - g) H_a \cdot \Delta t$ seconds, receiving a new chunk of actions every $(1 - g) H_a \cdot \Delta t + \mathbb E[\ell_S]$, on average.
The presence of the observation similarity filter dilates this processing time, and serves the scope of avoiding the robot stalling due to the queue being constantly integrated with an incoming, nearly identical, action chunk.
In particular, the referenced figure results in a queue which is filled with incoming actions *unless* near-duplicate observations are filtered out from the processing pipeline. For clarity, the red arrow in the referenced figure highlights a timestep where the observation similarity mechanism is bypassed, forcing a (nearly identical) observation to be processed as the queue results empty.

### Code Example: Using Async Inference

[^1]: Throughout, we will adopt the terminology and notation for SL introduced in @shalev-shwartzUnderstandingMachineLearning2014

[^2]: $o,a = z_0$ for the sake of notation. Steps omitted for brevity. See Section A in @hoDenoisingDiffusionProbabilistic2020 for a complete derivation.