diff --git "a/app/src/content/article.mdx" "b/app/src/content/article.mdx" --- "a/app/src/content/article.mdx" +++ "b/app/src/content/article.mdx" @@ -85,7 +85,7 @@ We sincerely hope this tutorial serves as a valuable starting point for your jou src={ch1_lerobot_figure1} zoomable downloadable - id="fig:figure1" + id="figure1" layout="fixed" alt="lerobot is the open-source library for end-to-end robotics developed by Hugging Face. The library is..." caption={'lerobot is the open-source library for end-to-end robotics developed by Hugging Face. The library is vertically integrated on the entire robotics stack, supporting low-level control of real-world robot devices, advanced data and inference optimizations, as well as SOTA robot learning methods with simple implementations in pure Pytorch.'} @@ -101,13 +101,13 @@ Robotics is, at its core, an inherently multidisciplinary field, requiring a wid This tutorial serves the double purpose of providing useful references for the Science behind--and practical use of--common robot learning techniques. To this aim, we strike to provide a rigorous yet concise overview of the core concepts behind the techniques presented, paired with practical examples of how to use such techniques concretely, with code examples in `lerobot`, for researchers and practitioners interested in the field of robot learning. This tutorial is structured as follows: -- Section 2 reviews classical robotics foundations, introducing the limitations of dynamics-based approaches to robotics. +- Section 2 reviews classical robotics foundations, introducing the limitations of dynamics-based approaches to robotics. -- Section 3 elaborates on the limitations of dynamics-based methods, and introduce RL as a practical approach to solve robotics problems, considering its upsides and potential limitations. +- Section 3 elaborates on the limitations of dynamics-based methods, and introduce RL as a practical approach to solve robotics problems, considering its upsides and potential limitations. -- Section 4 further describes robot learning techniques that aim at solving single-tasks learning, leveraging BC techniques to autonomously reproduce specific expert demonstrations. +- Section 4 further describes robot learning techniques that aim at solving single-tasks learning, leveraging BC techniques to autonomously reproduce specific expert demonstrations. -- Section 5 presents recent contributions on developing generalist models for robotics applications, by learning from large corpora of multi-task  multi-robot data (*robotics foundation models*). +- Section 5 presents recent contributions on developing generalist models for robotics applications, by learning from large corpora of multi-task  multi-robot data (*robotics foundation models*). Our goal with this tutorial is to provide an intuitive explanation of the reasons various disparate ideas from Machine Learning (ML) have converged and are powering the current evolution of Robotics, driving the unprecedented progress we see today. We complement our presentation of the most common and recent approaches in robot learning with practical code implementations using `lerobot`, and start here by presenting the dataset format introduced with `lerobot`. @@ -226,7 +226,7 @@ TL;DR Learning-based approaches to robotics are motivated by the need to (1) gen src={ch2_approaches} zoomable downloadable - id="fig:generating-motion-atlas" + id="generating-motion-atlas" layout="fixed" alt="Overview of methods to generate motion (clearly non-exhausitve, see @bekrisStateRobotMotion2024). Th..." caption={'Overview of methods to generate motion (clearly non-exhausitve, see @bekrisStateRobotMotion2024). The different methods can be grouped based on whether they explicitly (dynamics-based) or implicitly (learning-based) model robot-environment interactions.'} @@ -234,7 +234,7 @@ TL;DR Learning-based approaches to robotics are motivated by the need to (1) gen Robotics is concerned with producing artificial motion in the physical world in useful, reliable and safe fashion. Thus, robotics is an inherently multi-disciplinar domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components. Further, knowledge of mechanical, electrical, and software engineering, as well as rigid-body mechanics and control theory have therefore proven quintessential in robotics since the field first developed in the 1950s. More recently, Machine Learning (ML) has also proved effective in robotics, complementing these more traditional disciplines @connellRobotLearning1993. As a direct consequence of its multi-disciplinar nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of producing artificial motion in the physical world. -Methods to produce robotics motion range from traditional *explicit* models--dynamics-based[^1] methods, leveraging precise descriptions of the mechanics of robots’ rigid bodies and their interactions with eventual obstacles in the environment--to *implicit* models--learning-based methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings @agrawalComputationalSensorimotorLearning, @bekrisStateRobotMotion2024. A variety of methods have been developed between these two extrema. For instance,  @hansenTemporalDifferenceLearning2022 show how learning-based systems can benefit from information on the physics of problems, complementing a traditional learning method such as Temporal Difference (TD)-learning @suttonReinforcementLearningIntroduction2018 with Model-Predictive Control (MPC). Conversely, as explicit models may be relying on assumptions proving overly simplistic--or even unrealistic--in practice, learning can prove effective to improve modeling of complex phenomena or complement perception @mccormacSemanticFusionDense3D2016. Such examples aim at demonstrating the richness of approaches to robotics, and Figure 2 graphically illustrates some of the most relevant techniques. Such a list is clearly far from being exhaustive, and we refer to @bekrisStateRobotMotion2024 for a more comprehensive overview of both general and application-specific methods for motion generation. In this section, we wish to introduce the inherent benefits of learning-based approaches to robotics--the core focus on this tutorial. +Methods to produce robotics motion range from traditional *explicit* models--dynamics-based[^1] methods, leveraging precise descriptions of the mechanics of robots’ rigid bodies and their interactions with eventual obstacles in the environment--to *implicit* models--learning-based methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings @agrawalComputationalSensorimotorLearning, @bekrisStateRobotMotion2024. A variety of methods have been developed between these two extrema. For instance,  @hansenTemporalDifferenceLearning2022 show how learning-based systems can benefit from information on the physics of problems, complementing a traditional learning method such as Temporal Difference (TD)-learning @suttonReinforcementLearningIntroduction2018 with Model-Predictive Control (MPC). Conversely, as explicit models may be relying on assumptions proving overly simplistic--or even unrealistic--in practice, learning can prove effective to improve modeling of complex phenomena or complement perception @mccormacSemanticFusionDense3D2016. Such examples aim at demonstrating the richness of approaches to robotics, and Figure 2 graphically illustrates some of the most relevant techniques. Such a list is clearly far from being exhaustive, and we refer to @bekrisStateRobotMotion2024 for a more comprehensive overview of both general and application-specific methods for motion generation. In this section, we wish to introduce the inherent benefits of learning-based approaches to robotics--the core focus on this tutorial. ### Different Types of Motion @@ -242,13 +242,13 @@ Methods to produce robotics motion range from traditional *explicit* models-- -In the vast majority of instances, robotics deals with producing motion via actuating joints connecting nearly entirely-rigid links. A key distinction between focus areas in robotics is based on whether the generated motion modifies (1) the absolute state of the environment (via dexterity), (2) the relative state of the robot with respect to its environment (exercising mobility skills), or (3) a combination of the two (Figure 3). +In the vast majority of instances, robotics deals with producing motion via actuating joints connecting nearly entirely-rigid links. A key distinction between focus areas in robotics is based on whether the generated motion modifies (1) the absolute state of the environment (via dexterity), (2) the relative state of the robot with respect to its environment (exercising mobility skills), or (3) a combination of the two (Figure 3). Effects such as (1) are typically achieved *through* the robot, i.e. generating motion to perform an action inducing a desirable modification, effectively *manipulating* the environment (manipulation). Motions like (2) may result in changes in the robot’s physical location within its environment. Generally, modifications to a robot’s location within its environment may be considered instances of the general *locomotion* problem, further specified as *wheeled* or *legged* locomotion based on whenever a robot makes use of wheels or leg(s) to move in the environment. Lastly, an increased level of dynamism in the robot-environment interactions can be obtained combining (1) and (2), thus designing systems capable to interact with *and* move within their environment. This category is problems is typically termed *mobile manipulation*, and is characterized by a typically much larger set of control variables compared to either locomotion or manipulation alone. @@ -258,13 +258,13 @@ The traditional body of work developed since the very inception of robotics is i Robot manipulators typically consist of a series of links and joints, articulated in a chain finally connected to an *end-effector*. Actuated joints are considered responsible for generating motion of the links, while the end effector is instead used to perform specific actions at the target location (e.g., grasping/releasing objects via closing/opening a gripper end-effector, using a specialized tool like a screwdriver, etc.). -Recently, the development of low-cost manipulators like the ALOHA @zhaoLearningFineGrainedBimanual2023 ALOHA-2 @aldacoALOHA2Enhanced and SO-100/SO-101 @knightStandardOpenSO100 platforms significantly lowered the barrier to entry to robotics, considering the increased accessibility of these robots compared to more traditional platforms like the Franka Emika Panda arm (Figure 4). +Recently, the development of low-cost manipulators like the ALOHA @zhaoLearningFineGrainedBimanual2023 ALOHA-2 @aldacoALOHA2Enhanced and SO-100/SO-101 @knightStandardOpenSO100 platforms significantly lowered the barrier to entry to robotics, considering the increased accessibility of these robots compared to more traditional platforms like the Franka Emika Panda arm (Figure 4). -Consider the (simple) case where a SO-100 is restrained from actuating (1) the shoulder pane and (2) the wrist flex and roll motors. This effectively reduces the degrees of freedom of the SO-100 from the original 5+1 (5 joints + 1 gripper) to 2+1 (shoulder lift, elbow flex + gripper). As the end-effector does not impact motion in this model, the SO-100 is effectively reduced to the planar manipulator robot presented in Figure 5, where spheres represent actuators, and solid lines indicate length-$l$ links from the base of the SO-100 to the end-effector (*ee*). +Consider the (simple) case where a SO-100 is restrained from actuating (1) the shoulder pane and (2) the wrist flex and roll motors. This effectively reduces the degrees of freedom of the SO-100 from the original 5+1 (5 joints + 1 gripper) to 2+1 (shoulder lift, elbow flex + gripper). As the end-effector does not impact motion in this model, the SO-100 is effectively reduced to the planar manipulator robot presented in Figure 5, where spheres represent actuators, and solid lines indicate length-$l$ links from the base of the SO-100 to the end-effector (*ee*). Further, let us make the simplifying assumption that actuators can produce rotations up to $2 \pi$ radians. In practice, this is seldom the case due to movement obstructions caused by the robot body itself (for instance, the shoulder lift cannot produce counter-clockwise movement due to the presence of the robot’s base used to secure the SO-100 to its support and host the robot bus), but we will introduce movement obstruction at a later stage. -All these simplifying assumptions leave us with the planar manipulator of Figure 6, free of moving its end-effector by controlling the angles $\theta_1$ and $\theta_2$, jointly referred to as the robot’s *configuration*, and indicated with $q = [\theta_1, \theta_2 ] \in [-\pi, +\pi]^2$. The axis attached to the joints indicate the associated reference frame, whereas circular arrows indicate the maximal feasible rotation allowed at each joint. In this tutorial, we do not cover topics related to spatial algebra, and we instead refer the reader to and for excellent explanations of the mechanics and theoretical foundations of producing motion on rigid bodies. +All these simplifying assumptions leave us with the planar manipulator of Figure 6, free of moving its end-effector by controlling the angles $\theta_1$ and $\theta_2$, jointly referred to as the robot’s *configuration*, and indicated with $q = [\theta_1, \theta_2 ] \in [-\pi, +\pi]^2$. The axis attached to the joints indicate the associated reference frame, whereas circular arrows indicate the maximal feasible rotation allowed at each joint. In this tutorial, we do not cover topics related to spatial algebra, and we instead refer the reader to and for excellent explanations of the mechanics and theoretical foundations of producing motion on rigid bodies. -
+
Planar, 2-dof schematic representation of the SO-100 manipulator under diverse deployment settings. From left to right: completely free of moving; constrained by the presence of the surface; constrained by the surface and presence of obstacles. Circular arrows around each joint indicate the maximal rotation feasible at that joint.
-Considering the (toy) example presented in Figure 6, then we can analytically write the end-effector’s position $p \in \mathbb R^2$ as a function of the robot’s configuration, $p = p(q), p: \mathcal Q \mapsto \mathbb R^2$. In particular, we have: +Considering the (toy) example presented in Figure 6, then we can analytically write the end-effector’s position $p \in \mathbb R^2$ as a function of the robot’s configuration, $p = p(q), p: \mathcal Q \mapsto \mathbb R^2$. In particular, we have: $$ `p(q) = \begin{pmatrix} p_x(\theta_1, \theta_2)\\ p_y(\theta_1, \theta_2) \end{pmatrix} = \begin{pmatrix} l \cos(\theta_1) + l \cos(\theta_1 + \theta_2)\\ l \sin(\theta_1) + l \sin(\theta_1 + \theta_2) \end{pmatrix} \in S^{n=2}_{l_1+l_2} = \{ p(q) \in \mathbb R^2: \Vert p(q) \Vert_2^2 \leq (2l)^2, \ \forall q \in \mathcal Q \}` @@ -331,17 +331,17 @@ Deriving the end-effector’s *pose*--position *and* orientation--in some $m$-di In the simplified case here considered (for which $\boldsymbol{p} \equiv p$, as the orientation of the end-effector is disregarded for simplicity), one can solve the problem of controlling the end-effector’s location to reach a goal position $p^*$ by solving analytically for $q: p(q) = f_{\text{FK}}(q) = p^*$. However, in the general case, one might not be able to solve this problem analytically, and can typically resort to iterative optimization methods comparing candidate solutions using a loss function (in the simplest case, $\Vert p(q) - p^* \Vert_2^2$ is a natural candidate), yielding: -$\min_{q \in \mathcal Q} \Vert p(q) - p^* \Vert_2^2 \, . $ +$\htmlId{ik_problem}{\min_{q \in \mathcal Q} \Vert p(q) - p^* \Vert_2^2 \, .}$ Exact analytical solutions to IK are even less appealing when one considers the presence of obstacles in the robot’s workspace, resulting in constraints on the possible values of $q \in \mathcal Q \subseteq [-\pi, +\pi]^n \subset \mathbb R^n$ in the general case of $n$-links robots. -For instance, the robot in Figure 7 is (very naturally) obstacled by the presence of the surface upon which it rests: $\theta_1$ can now exclusively vary within $[0, \pi]$, while possible variations in $\theta_2$ depend on $\theta_1$ (when $\theta_1 \to 0$ or $\theta_1 \to \pi$, further downwards movements are restricted). Even for a simplified kinematic model, developing techniques to solve eq. [eq:ik_problem] is in general non-trivial in the presence of constraints, particularly considering that the feasible set of solutions $\mathcal Q$ may change across problems. Figure 9 provides an example of how the environment influences the feasible set considered, with a new set of constraints deriving from the position of a new obstacle. +For instance, the robot in Figure 7 is (very naturally) obstacled by the presence of the surface upon which it rests: $\theta_1$ can now exclusively vary within $[0, \pi]$, while possible variations in $\theta_2$ depend on $\theta_1$ (when $\theta_1 \to 0$ or $\theta_1 \to \pi$, further downwards movements are restricted). Even for a simplified kinematic model, developing techniques to solve eq. [ik_problem] is in general non-trivial in the presence of constraints, particularly considering that the feasible set of solutions $\mathcal Q$ may change across problems. Figure 9 provides an example of how the environment influences the feasible set considered, with a new set of constraints deriving from the position of a new obstacle. -However, IK--solving eq. [eq:ik_problem] for a feasible $q$--only proves useful in determining information regarding the robot’s configuration in the goal pose, and crucially does not provide information on the *trajectory* to follow over time to reach a target pose. Expert-defined trajectories obviate to this problem providing a length-$K$ succession of goal poses $\tau_K = [p^*_0, p^*_1, \dots p^*_K]$ for tracking. In practice, trajectories can also be obtained automatically through *motion planning* algorithms, thus avoiding expensive trajectory definition from human experts. However, tracking $\tau_K$ via IK can prove prohibitively expensive, as tracking would require $K$ resolutions of eq. [eq:ik_problem] (one for each target pose). *Differential* inverse kinematics (diff-IK) complements IK via closed-form solution of a variant of eq. [eq:ik_problem]. Let $J(q)$ denote the Jacobian matrix of (partial) derivatives of the FK-function $f_\text{FK}: \mathcal Q \mapsto \mathcal P$, such that $J(q) = \frac{\partial f_{FK}(q)}{\partial q }$. Then, one can apply the chain rule to any $p(q) = f_{\text{FK}}(q)$, deriving $\dot p = J(q) \dot q$, and thus finally relating variations in the robot configurations to variations in pose, thereby providing a platform for control. +However, IK--solving eq. [ik_problem] for a feasible $q$--only proves useful in determining information regarding the robot’s configuration in the goal pose, and crucially does not provide information on the *trajectory* to follow over time to reach a target pose. Expert-defined trajectories obviate to this problem providing a length-$K$ succession of goal poses $\tau_K = [p^*_0, p^*_1, \dots p^*_K]$ for tracking. In practice, trajectories can also be obtained automatically through *motion planning* algorithms, thus avoiding expensive trajectory definition from human experts. However, tracking $\tau_K$ via IK can prove prohibitively expensive, as tracking would require $K$ resolutions of eq. [ik_problem] (one for each target pose). *Differential* inverse kinematics (diff-IK) complements IK via closed-form solution of a variant of eq. [ik_problem]. Let $J(q)$ denote the Jacobian matrix of (partial) derivatives of the FK-function $f_\text{FK}- \mathcal Q \mapsto \mathcal P$, such that $J(q) = \frac{\partial f_{FK}(q)}{\partial q }$. Then, one can apply the chain rule to any $p(q) = f_{\text{FK}}(q)$, deriving $\dot p = J(q) \dot q$, and thus finally relating variations in the robot configurations to variations in pose, thereby providing a platform for control. -Given a desired end-effector trajectory $\dot {p}^*(t)$ (1) indicating anchor regions in space and (2) how much time to spend in each region, diff-IK finds $\dot q(t)$ solving for joints’ *velocities* instead of *configurations*, $\dot q(t) = \arg\min_\nu \; \lVert J(q(t)) \nu - \dot {p}^*(t) \rVert_2^2 $ +Given a desired end-effector trajectory $\dot {p}^*(t)$ (1) indicating anchor regions in space and (2) how much time to spend in each region, diff-IK finds $\dot q(t)$ solving for joints’ *velocities* instead of *configurations*, $\htmlId{reg_ik_velocity}{\dot q(t) = \arg\min_\nu \; \lVert J(q(t)) \nu - \dot {p}^*(t) \rVert_2^2}$ -Unlike eq. [eq:ik_problem], solving for $\dot q$ is much less dependent on the environment (typically, variations in velocity are constrained by physical limits on the actuators). Conveniently, eq. [eq:reg_ik_velocity] also often admits the closed-form solution $\dot q = J(q)^+ \dot {p}^*$, where $J^+(q)$ denotes the Moore-Penrose pseudo-inverse of $J(q)$. Finally, discrete-time joint configurations $q$ can be reconstructed from joint velocities $\dot q$ using forward-integration on the continuous-time joint velocity , $q_{t+1} = q_t + \Delta t\,\dot q_t$ for a given $\Delta t$, resulting in tracking via diff-IK. +Unlike eq. [ik_problem], solving for $\dot q$ is much less dependent on the environment (typically, variations in velocity are constrained by physical limits on the actuators). Conveniently, eq. [reg_ik_velocity] also often admits the closed-form solution $\dot q = J(q)^+ \dot {p}^*$, where $J^+(q)$ denotes the Moore-Penrose pseudo-inverse of $J(q)$. Finally, discrete-time joint configurations $q$ can be reconstructed from joint velocities $\dot q$ using forward-integration on the continuous-time joint velocity , $q_{t+1} = q_t + \Delta t\,\dot q_t$ for a given $\Delta t$, resulting in tracking via diff-IK. Following trajectories with diff-IK is a valid option in well-controlled and static environments (e.g., industrial manipulators in controlled manufacturing settings), and relies on the ability to define a set of target velocities to track $[\dot {p}^*_0, \dot {p}^*_1, \dots, \dot {p}^*_k ]$--an error-prone task largely requiring human expertise. Furthermore, diff-IK relies on the ability to (1) access $J(q) \, \forall q \in \mathcal Q$ and (2) compute its pseudo-inverse at every iteration of a given control cycle--a challenging assumption in highly dynamical settings, or for complex kinematic chains. @@ -360,7 +360,7 @@ r0.3 -One such case is presented in Figure [fig:planar-manipulator-box-velocity], where another rigid body other than the manipulator is moving in the environment along the horizontal axis, with velocity $\dot x_B$. Accounting analytically for the presence of this disturbance--for instance, to prevent the midpoint of the link from ever colliding with the object--requires access to $\dot x_B$ at least, to derive the equation characterizing the motion of the environment. +One such case is presented in Figure [planar-manipulator-box-velocity], where another rigid body other than the manipulator is moving in the environment along the horizontal axis, with velocity $\dot x_B$. Accounting analytically for the presence of this disturbance--for instance, to prevent the midpoint of the link from ever colliding with the object--requires access to $\dot x_B$ at least, to derive the equation characterizing the motion of the environment. Less predictable disturbances however (e.g., $\dot x_B \leftarrow \dot x_B + {\varepsilon}, {\varepsilon}\sim N(0,1)$) may prove challenging to model analytically, and one could attain the same result of preventing link-object collision by adding a condition on the distance between the midpoint of $l$ and $x_B$, enforced through a feedback loop on the position of the robot and object at each control cycle. @@ -378,7 +378,7 @@ Despite the last 60+ years of robotics research, autonomous robots are still lar src={ch2_classical_limitations} zoomable downloadable - id="fig:classical-limitations" + id="classical-limitations" layout="fixed" alt="Dynamics-based approaches to robotics suffer from several limitations: (1) orchestrating multiple co..." caption={'Dynamics-based approaches to robotics suffer from several limitations: (1) orchestrating multiple components poses integration challenges; (2) the need to develop custom processing pipelines for the sensing modalities and tasks considered hinders scalability; (3) simplified analytical models of physical phenomena (here friction at the gripper; credits to @antonovaReinforcementLearningPivoting2017) limit real-world performance. Lastly, (4) dynamics-based methods overlook trends in the availability and growth of robotics data.'} @@ -392,7 +392,7 @@ Setting aside integration and scalability challenges: developing accurate modeli Lastly, dynamics-based methods (naturally) overlook the rather recent increase in availability of openly-available robotics datasets. The curation of academic datasets by large centralized groups of human experts in robotics @collaborationOpenXEmbodimentRobotic2025, @khazatskyDROIDLargeScaleInTheWild2025 is now increasingly complemented by a growing number of robotics datasets contributed in a decentralized fashion by individuals with varied expertise. If not tangentially, dynamics-based approaches are not posed to maximally benefit from this trend, which holds the premise of allowing generalization in the space of tasks and embodiments, like data was the cornerstone for advancements in vision @alayracFlamingoVisualLanguage2022 and natural-language understanding @brownLanguageModelsAre2020. -Taken together, these limitations (Figure 10) motivate the exploration of learning-based approaches that can (1) integrate perception and control more tightly, (2) adapt across tasks and embodiments with reduced expert modeling interventions and (3) scale gracefully in performance as more robotics data becomes available. +Taken together, these limitations (Figure 10) motivate the exploration of learning-based approaches that can (1) integrate perception and control more tightly, (2) adapt across tasks and embodiments with reduced expert modeling interventions and (3) scale gracefully in performance as more robotics data becomes available. ## Robot (Reinforcement) Learning @@ -412,51 +412,51 @@ TL;DR The need for expensive high-fidelity simulators can be obviated by learnin src={ch3_learning_benefits} zoomable downloadable - id="fig:robot-learning-upsides" + id="robot-learning-upsides" layout="fixed" alt="Learning-based robotics streamlines perception-to-action by learning a (1) unified high-level contro..." caption={'Learning-based robotics streamlines perception-to-action by learning a (1) unified high-level controller capable to take (2) high-dimensional, unstructured sensorimotor information. Learning (3) does not require a dynamics model and instead focuses on interaction data, and (4) empirically correlates with the scale of the data used.'} /> -Learning-based techniques for robotics naturally address the limitations presented in 2 (Figure 11). Learning-based techniques typically rely on prediction-to-action (*visuomotor policies*), thereby directly mapping sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components. Mapping sensorimotor inputs to actions directly also allows to add diverse input modalities, leveraging the automatic feature extraction characteristic of most modern learning systems. Further, learning-based approaches can in principle entirely bypass modeling efforts and instead rely exclusively on interactions data, proving transformative when dynamics are challenging to model or even entirely unknown. Lastly, learning for robotics (*robot learning*) is naturally well posed to leverage the growing amount of robotics data openly available, just as computer vision first and natural language processing later did historically benefit from large scale corpora of (possibly non curated) data, in great part overlooked by dynamics-based approaches. +Learning-based techniques for robotics naturally address the limitations presented in 2 (Figure 11). Learning-based techniques typically rely on prediction-to-action (*visuomotor policies*), thereby directly mapping sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components. Mapping sensorimotor inputs to actions directly also allows to add diverse input modalities, leveraging the automatic feature extraction characteristic of most modern learning systems. Further, learning-based approaches can in principle entirely bypass modeling efforts and instead rely exclusively on interactions data, proving transformative when dynamics are challenging to model or even entirely unknown. Lastly, learning for robotics (*robot learning*) is naturally well posed to leverage the growing amount of robotics data openly available, just as computer vision first and natural language processing later did historically benefit from large scale corpora of (possibly non curated) data, in great part overlooked by dynamics-based approaches. -Being a field at its relative nascent stages, no prevalent technique(s) proved distinctly better better in robot learning. Still, two major classes of methods gained prominence: reinforcement learning (RL) and Behavioral Cloning (BC) (Figure 12). In this section, we provide a conceptual overview of applications of the former to robotics, as well as introduce practical examples of how to use RL within `lerobot`. We then introduce the major limitations RL suffers from, to introduce BC techniques in the next sections ([sec:learning-bc-single, sec:learning-bc-generalist]). +Being a field at its relative nascent stages, no prevalent technique(s) proved distinctly better better in robot learning. Still, two major classes of methods gained prominence: reinforcement learning (RL) and Behavioral Cloning (BC) (Figure 12). In this section, we provide a conceptual overview of applications of the former to robotics, as well as introduce practical examples of how to use RL within `lerobot`. We then introduce the major limitations RL suffers from, to introduce BC techniques in the next sections ([learning-bc-single-sec-learning-bc-generalist]). -In Figure 12 we decided to include generalist robot models @blackp0VisionLanguageActionFlow2024, @shukorSmolVLAVisionLanguageActionModel2025 alongside task-specific BC methods. While significant different in spirit--*generalist* models are language-conditioned and use instructions to generate motion valid across many tasks, while *task-specific* models are typically not language-conditioned and used to perform a single task--foundation models are largely trained to reproduce trajectories contained in a large training set of input demonstrations. Thus, we argue generalist policies can indeed be grouped alongside other task-specific BC methods, as they both leverage similar training data and schemas. +In Figure 12 we decided to include generalist robot models @blackp0VisionLanguageActionFlow2024, @shukorSmolVLAVisionLanguageActionModel2025 alongside task-specific BC methods. While significant different in spirit--*generalist* models are language-conditioned and use instructions to generate motion valid across many tasks, while *task-specific* models are typically not language-conditioned and used to perform a single task--foundation models are largely trained to reproduce trajectories contained in a large training set of input demonstrations. Thus, we argue generalist policies can indeed be grouped alongside other task-specific BC methods, as they both leverage similar training data and schemas. -Figure 12 illustrates this categorization graphically, explicitly listing all the robot learning policies currently available in `lerobot`: Action Chunking with Transformers (ACT) @zhaoLearningFineGrainedBimanual2023, Diffusion Policy @chiDiffusionPolicyVisuomotor2024, Vector-Quantized Behavior Transformer (VQ-BeT) @leeBehaviorGenerationLatent2024, $\pi_0$ @blackp0VisionLanguageActionFlow2024, SmolVLA @shukorSmolVLAVisionLanguageActionModel2025, Human-in-the-loop Sample-efficient RL (HIL-SERL) @luoPreciseDexterousRobotic2024 and TD-MPC @hansenTemporalDifferenceLearning2022. +Figure 12 illustrates this categorization graphically, explicitly listing all the robot learning policies currently available in `lerobot`: Action Chunking with Transformers (ACT) @zhaoLearningFineGrainedBimanual2023, Diffusion Policy @chiDiffusionPolicyVisuomotor2024, Vector-Quantized Behavior Transformer (VQ-BeT) @leeBehaviorGenerationLatent2024, $\pi_0$ @blackp0VisionLanguageActionFlow2024, SmolVLA @shukorSmolVLAVisionLanguageActionModel2025, Human-in-the-loop Sample-efficient RL (HIL-SERL) @luoPreciseDexterousRobotic2024 and TD-MPC @hansenTemporalDifferenceLearning2022. -Applications of RL to robotics have been long studied, to the point the relationship between these two disciplines has been compared to that between physics and matematics @koberReinforcementLearningRobotics. Indeed, due to their interactive and sequential nature, many robotics problems can be directly mapped to RL problems. Figure 13 depicts two of such cases. Reaching for an object to move somewhere else in the scene is an indeed sequential problem where at each cycle the controller needs to adjust the position of the robotic arm based on their current configuration and the (possibly varying) position of the object. Figure 13 also shows an example of a locomotion problem, where sequentiality is inherent in the problem formulation. While sliding to the side, the controller has to constantly keep adjusting to the robot’s propioperception to avoid failure (falling). +Applications of RL to robotics have been long studied, to the point the relationship between these two disciplines has been compared to that between physics and matematics @koberReinforcementLearningRobotics. Indeed, due to their interactive and sequential nature, many robotics problems can be directly mapped to RL problems. Figure 13 depicts two of such cases. Reaching for an object to move somewhere else in the scene is an indeed sequential problem where at each cycle the controller needs to adjust the position of the robotic arm based on their current configuration and the (possibly varying) position of the object. Figure 13 also shows an example of a locomotion problem, where sequentiality is inherent in the problem formulation. While sliding to the side, the controller has to constantly keep adjusting to the robot’s propioperception to avoid failure (falling). ### A (Concise) Introduction to RL -The RL framework @suttonReinforcementLearningIntroduction2018, which we briefly introduce here, has often been used to model robotics problems @koberReinforcementLearningRobotics. RL is a subfield within ML fundamentally concerned with the development of autonomous systems (*agents*) learning how to *continuously behave* in an evolving environment, developing (ideally, well-performing) control strategies (*policies*). Crucially for robotics, RL agents can improve via trial-and-error only, thus entirely bypassing the need to develop explicit models of the problem dynamics, and rather exploiting interaction data only. In RL, this feedback loop (Figure 14) between actions and outcomes is established through the agent sensing a scalar quantity (*reward*). +The RL framework @suttonReinforcementLearningIntroduction2018, which we briefly introduce here, has often been used to model robotics problems @koberReinforcementLearningRobotics. RL is a subfield within ML fundamentally concerned with the development of autonomous systems (*agents*) learning how to *continuously behave* in an evolving environment, developing (ideally, well-performing) control strategies (*policies*). Crucially for robotics, RL agents can improve via trial-and-error only, thus entirely bypassing the need to develop explicit models of the problem dynamics, and rather exploiting interaction data only. In RL, this feedback loop (Figure 14) between actions and outcomes is established through the agent sensing a scalar quantity (*reward*). 6), and stochastic when unmodeled disturbances depending on non-observable parameters intervene (Figure [fig:planar-manipulator-box-velocity]). +- $\mathcal D$ represents the (possibly non-deterministic) environment dynamics, with $\mathcal D: \mathcal S\times \mathcal A\times \mathcal S\mapsto [0, 1]$ corresponding to $\mathcal D\, (s_t, a_t, s_{t+1})= \mathbb P (s_{t+1}\vert s_t, a_t)$. For instance, for a planar manipulator dynamics could be considered deterministic when the environment is fully described (Figure 6), and stochastic when unmodeled disturbances depending on non-observable parameters intervene (Figure [planar-manipulator-box-velocity]). -- $r: \mathcal S\times \mathcal A\times \mathcal S\to \mathbb R$ is the *reward function*, weighing the transition $(s_t, a_t, s_{t+1})$ in the context of the achievement of an arbitrary goal. For instance, a simple reward function for quickly moving the along the $x$ axis in 3D-space (Figure 13) could be based on the absolute position of the robot along the $x$ axis ($p_x$), present negative penalties for falling over (measured from $p_z$) and a introduce bonuses $\dot p_x$ for speed, $r (s_t, a_t, s_{t+1})\equiv r(s_t) = p_{x_t} \cdot \dot p_{x_t} - \tfrac{1}{p_{z_t}}$. +- $r: \mathcal S\times \mathcal A\times \mathcal S\to \mathbb R$ is the *reward function*, weighing the transition $(s_t, a_t, s_{t+1})$ in the context of the achievement of an arbitrary goal. For instance, a simple reward function for quickly moving the along the $x$ axis in 3D-space (Figure 13) could be based on the absolute position of the robot along the $x$ axis ($p_x$), present negative penalties for falling over (measured from $p_z$) and a introduce bonuses $\dot p_x$ for speed, $r (s_t, a_t, s_{t+1})\equiv r(s_t) = p_{x_t} \cdot \dot p_{x_t} - \tfrac{1}{p_{z_t}}$. Lastly, $\gamma \in [0,1]$ represent the discount factor regulating preference for immediate versus long-term reward (with an effective horizon equal to $\tfrac{1}{1-\gamma}$), and $\rho$ is the distribution, defined over $\mathcal S$, the MDP’s *initial* state is sampled from, $s_0 \sim \rho$. A length-$T$ *trajectory* is the (random) sequence ``` math -\begin{equation} - - \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_{T-1}, a_{T-1}, r_{T-1}, s_T), -\end{equation} +\htmlId{trajectory_definition}{\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_{T-1}, a_{T-1}, r_{T-1}, s_T),} ``` with per-step rewards defined as $r_t = r (s_t, a_t, s_{t+1})$ for ease of notation.Interestingly, assuming both the environment dynamics and conditional distribution over actions given states--the *policy*--to be *Markovian*: $$ -`\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb P (s_{t+1}\vert s_t, a_t)\\ \mathbb P(a_t\vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) = \mathbb P(a_t\vert s_t) ` +`\htmlId{dynamics_markovian}{\mathbb P(s_{t+1}\vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) = \mathbb P (s_{t+1}\vert s_t, a_t)\\ \mathbb P(a_t\vert s_t, a_{t-1}, s_{t-1}, s_0, a_0) = \mathbb P(a_t\vert s_t)}` $$ The probability of observing a given trajectory $\tau$ factorizes into ``` math -\begin{equation} - - \mathbb P(\tau) = \mathbb P (s_0) \prod_{t=0}^{T-1} \mathbb P (s_{t+1}\vert s_t, a_t)\ \mathbb P(a_t\vert s_t). -\end{equation} +\htmlId{traj_prob}{\mathbb P(\tau) = \mathbb P (s_0) \prod_{t=0}^{T-1} \mathbb P (s_{t+1}\vert s_t, a_t)\ \mathbb P(a_t\vert s_t).} ``` Policies $\mathbb P(a_t\vert s_t)$ are typically indicated as $\pi(a_t\vert s_t)$, and often parametrized via $\theta$, yielding $\pi_\theta (a_t\vert s_t)$. Policies are trained optimizing the (discounted) *return* associated to a given $\tau$, i.e. the (random) sum of measured rewards over trajectory: @@ -504,12 +498,12 @@ G(\tau) = \sum_{t=0}^{T-1} \gamma^{t} r_t. In that, agents seek to learn control strategies (*policies*, $\pi_\theta$) maximizing the expected return $\mathbb E_{\tau \sim \pi_\theta} G(\tau)$. For a given dynamics $\mathcal D$--i.e., for a given problem--taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies $\Pi$, yielding the (maximization) target $J : \Pi \mapsto \mathbb R$ $$ -`J(\pi_\theta) = \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} [G(\tau)],\\ \mathbb P_{\theta; \mathcal D} (\tau) = \rho \prod_{t=0}^{T-1} \mathcal D (s_t, a_t, s_{t+1})\ \pi_\theta (a_t\vert s_t).` +`\htmlId{RL-j-function}{J(\pi_\theta) = \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} [G(\tau)],\\ \mathbb P_{\theta; \mathcal D} (\tau) = \rho \prod_{t=0}^{T-1} \mathcal D (s_t, a_t, s_{t+1})\ \pi_\theta (a_t\vert s_t).}` $$ -Because in the RL framework the agent is assumed to only be able to observe the environment dynamics and not to intervene on them, [eq:RL-j-function] varies exclusively with the policy followed. In turn, MDPs naturally provide a framework to optimize over the space of the possible behaviors an agent might enact ($\pi \in \Pi$), searching for the *optimal policy* $\pi^* = \arg \max_{\theta} J(\pi_\theta)$, where $\theta$ is the parametrization adopted by the policy set $\Pi: \pi_\theta \in \Pi, \ \forall \theta$. Other than providing a target for policy search, $G(\tau)$ can also be used as a target to discriminate between states and state-action pairs. Given any state $s \in \mathcal S$--e.g., a given configuration of the robot--the *state-value* function +Because in the RL framework the agent is assumed to only be able to observe the environment dynamics and not to intervene on them, [RL-j-function] varies exclusively with the policy followed. In turn, MDPs naturally provide a framework to optimize over the space of the possible behaviors an agent might enact ($\pi \in \Pi$), searching for the *optimal policy* $\pi^* = \arg \max_{\theta} J(\pi_\theta)$, where $\theta$ is the parametrization adopted by the policy set $\Pi: \pi_\theta \in \Pi, \ \forall \theta$. Other than providing a target for policy search, $G(\tau)$ can also be used as a target to discriminate between states and state-action pairs. Given any state $s \in \mathcal S$--e.g., a given configuration of the robot--the *state-value* function ``` math V_\pi(s) = \mathbb E_{\tau \sim \pi} [G(\tau) \big \vert s_0 = s] ``` @@ -520,16 +514,16 @@ Q_\pi(s,a) = \mathbb E_{\tau \sim \pi} [G (\tau) \big \vert s_0 = s, a_0=a] Crucially, value functions are interrelated: $$ -`Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}\sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma V_\pi(s_{t+1})]\\ V_\pi(s_t) = \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)] ` +`\htmlId{q-as-v}{Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}\sim \mathbb P(\bullet \vert s_t, a_t)} [r_t + \gamma V_\pi(s_{t+1})]\\ V_\pi(s_t) = \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)]}` $$ - Inducing an ordering over states and state-action pairs under $\pi$, value functions are central to most RL algorithms. A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure 15). + Inducing an ordering over states and state-action pairs under $\pi$, value functions are central to most RL algorithms. A variety of methods have been developed in RL as standalone attemps to find (approximate) solutions to the problem of maximizing cumulative reward (Figure 15). -Training RL policies in simulation @tobinDomainRandomizationTransferring2017 addresses both issues: it eliminates physical risk and dramatically increases throughput. Yet, simulators require significant modeling effort, and rely on assumptions (simplified physical modeling, instantaneous actuation, static environmental conditions, etc.) limiting transferring policies learned in simulation due the discrepancy between real and simulated environments (*reality gap*, Figure 16). *Domain randomization* (DR) is a popular technique to overcome the reality gap, consisting in randomizing parameters of the simulated environment during training, to induce robustness to specific disturbances. In turn, DR is employed to increase the diversity of scenarios over the course of training, improving on the chances sim-to-real transfer @akkayaSolvingRubiksCube2019, @antonovaReinforcementLearningPivoting2017, @jiDribbleBotDynamicLegged2023. In practice, DR is performed further parametrizing the *simulator*’s dynamics $\mathcal D \equiv \mathcal D_\xi$ with a *dynamics* (random) vector $\xi$ drawn an arbitrary distribution, $\xi \sim \Xi$. Over the course of training--typically at each episode’s reset--a new $\xi$ is drawn, and used to specify the environment’s dynamics for that episode. For instance, one could decide to randomize the friction coefficient of the surface in a locomotion task (Figure 17), or the center of mass of an object for a manipulation task. +Training RL policies in simulation @tobinDomainRandomizationTransferring2017 addresses both issues: it eliminates physical risk and dramatically increases throughput. Yet, simulators require significant modeling effort, and rely on assumptions (simplified physical modeling, instantaneous actuation, static environmental conditions, etc.) limiting transferring policies learned in simulation due the discrepancy between real and simulated environments (*reality gap*, Figure 16). *Domain randomization* (DR) is a popular technique to overcome the reality gap, consisting in randomizing parameters of the simulated environment during training, to induce robustness to specific disturbances. In turn, DR is employed to increase the diversity of scenarios over the course of training, improving on the chances sim-to-real transfer @akkayaSolvingRubiksCube2019, @antonovaReinforcementLearningPivoting2017, @jiDribbleBotDynamicLegged2023. In practice, DR is performed further parametrizing the *simulator*’s dynamics $\mathcal D \equiv \mathcal D_\xi$ with a *dynamics* (random) vector $\xi$ drawn an arbitrary distribution, $\xi \sim \Xi$. Over the course of training--typically at each episode’s reset--a new $\xi$ is drawn, and used to specify the environment’s dynamics for that episode. For instance, one could decide to randomize the friction coefficient of the surface in a locomotion task (Figure 17), or the center of mass of an object for a manipulation task. [eq:dqn-loss] via Monte-Carlo (MC) estimates. + Where $\chi$ represents a behavior distribution over state-action pairs. Crucially, $\chi$ can in principle be different from the policy being followed, effectively allowing to reuse prior data stored in a *replay buffer* in the form of $(s_t, a_t, r_t, s_{t+1})$ transitions, used to form the TD-target $y_i$, TD-error $\delta_i$ and loss function [dqn-loss] via Monte-Carlo (MC) estimates. While effective in handling large, unstructured state spaces for discrete action-space problems, DQN application’s to continous control problems proved challenging. Indeed, in the case of high-capacity function approximators such as neural networks, solving $\max_{a_t \in \mathcal A} Q_\theta(s_t, a_t)$ at each timestep is simply unfeasible due to the (1) continous nature of the action space ($\mathcal A\subset \mathbb R^n$ for some $n$) and (2) impossibility to express the find a cheap (ideally, closed-form) solution to $Q_\theta$.  @silverDeterministicPolicyGradient2014 tackle this fundamental challenge by using a *deterministic* function of the state $s_t$ as policy, $\mu_\phi(s_t) = a_t$, parametrized by $\phi$. Thus, policies can be iteratively refined updating $\phi$ along the direction: ``` math -\begin{equation} - - d_\phi = \mathbb E_{s_t \sim \mathbb P (\bullet)} [\nabla_\phi Q(s_t, a_t)\vert_{a_t = \mu_\phi(s_t)}] = \mathbb E_{s_t \sim \mathbb P(\bullet)} [\nabla_{a_t} Q(s_t, a_t) \vert_{a_t = \mu_\phi(s_t)} \cdot \nabla_\phi \mu(s_t)] -\end{equation} +\htmlId{deterministic-pg}{d_\phi = \mathbb E_{s_t \sim \mathbb P (\bullet)} [\nabla_\phi Q(s_t, a_t)\vert_{a_t = \mu_\phi(s_t)}] = \mathbb E_{s_t \sim \mathbb P(\bullet)} [\nabla_{a_t} Q(s_t, a_t) \vert_{a_t = \mu_\phi(s_t)} \cdot \nabla_\phi \mu(s_t)]} ``` -Provably, [eq:deterministic-pg] is the *deterministic policy gradient* (DPG) of the policy $\mu_\phi$ @silverDeterministicPolicyGradient2014, so that updates $\phi_{k+1}\leftarrow \phi_k + \alpha d_\phi$ are guaranteed to increase the (deterministic) cumulative discounted reward, $J(\mu_\phi)$.  @lillicrapContinuousControlDeep2019 extended DPG to the case of (1) high-dimensional unstructured observations and (2) continuous action spaces, introducing Deep Deterministic Policy Gradient (DDPG), an important algorithm RL and its applications to robotics. DDPG adopts a modified TD-target compared to the one defined in [eq:TD-target], by maintaining a policy network used to select actions, yielding +Provably, [deterministic-pg] is the *deterministic policy gradient* (DPG) of the policy $\mu_\phi$ @silverDeterministicPolicyGradient2014, so that updates $\phi_{k+1}\leftarrow \phi_k + \alpha d_\phi$ are guaranteed to increase the (deterministic) cumulative discounted reward, $J(\mu_\phi)$.  @lillicrapContinuousControlDeep2019 extended DPG to the case of (1) high-dimensional unstructured observations and (2) continuous action spaces, introducing Deep Deterministic Policy Gradient (DDPG), an important algorithm RL and its applications to robotics. DDPG adopts a modified TD-target compared to the one defined in [TD-target], by maintaining a policy network used to select actions, yielding ``` math -\begin{equation} - -y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma Q_{\theta_{i-1}} (s_{t+1}, \mu_\phi(s_{t+1})) \big] . -\end{equation} +\htmlId{TD-target-ddpg}{y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma Q_{\theta_{i-1}} (s_{t+1}, \mu_\phi(s_{t+1})) \big] .} ``` Similarily to DQN, DDPG also employs the same replay buffer mechanism, to reuse past transitions over training for increased sample efficiency and estimate the loss function via MC-estimates. -Soft Actor-Critic (SAC) @haarnojaSoftActorCriticOffPolicy2018 is a derivation of DDPG in the max-entropy (MaxEnt) RL framework, in which RL agents are tasked with maximizing the discounted cumulative reward, while acting as randomly as possible. MaxEnt RL @haarnojaReinforcementLearningDeep2017 has proven particularly robust thanks to the development of diverse behaviors, incentivized by its entropy-regularization formulation. In that, MaxEnt revisits the RL objective $J (\pi)$ to specifically account for the policy entropy, $J(\pi) = \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \chi} [r_t + \alpha \mathcal H(\pi (\bullet \vert s_t))] $ This modified objective results in the *soft* TD-target: +Soft Actor-Critic (SAC) @haarnojaSoftActorCriticOffPolicy2018 is a derivation of DDPG in the max-entropy (MaxEnt) RL framework, in which RL agents are tasked with maximizing the discounted cumulative reward, while acting as randomly as possible. MaxEnt RL @haarnojaReinforcementLearningDeep2017 has proven particularly robust thanks to the development of diverse behaviors, incentivized by its entropy-regularization formulation. In that, MaxEnt revisits the RL objective $J (\pi)$ to specifically account for the policy entropy, $\htmlId{J-soft}{J(\pi) = \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \chi} [r_t + \alpha \mathcal H(\pi (\bullet \vert s_t))]}$ This modified objective results in the *soft* TD-target: ``` math -\begin{equation} - - y_i = \mathbb E_{s_{t+1} \sim \mathbb P( \bullet \vert s_t, a_t)} [r_t + \gamma \left( Q_{\theta_{i-1}} (s_{t+1}, a_{t+1}) - \alpha \log \pi_\phi(a_{t+1} \vert s_{t+1}) \right)], \quad a_{t+1} \sim \pi_\phi(\bullet \vert s_t) -\end{equation} +\htmlId{soft-td-target}{y_i = \mathbb E_{s_{t+1} \sim \mathbb P( \bullet \vert s_t, a_t)} [r_t + \gamma \left( Q_{\theta_{i-1}} (s_{t+1}, a_{t+1}) - \alpha \log \pi_\phi(a_{t+1} \vert s_{t+1}) \right)], \quad a_{t+1} \sim \pi_\phi(\bullet \vert s_t)} ``` -Similarily to DDPG, SAC also maintains an explicit policy, trained under the same MaxEnt framework for the maximization of [eq:J-soft], and updated using: +Similarily to DDPG, SAC also maintains an explicit policy, trained under the same MaxEnt framework for the maximization of [J-soft], and updated using- ``` math -\begin{equation} - - \pi_{k+1} \leftarrow \arg\min_{\pi^\prime \in \Pi} \text{D}_{\text{KL}}\left(\pi^\prime (\bullet \vert s_t) \bigg\Vert \frac{\exp(Q_{\pi_k}(s_t, \bullet))}{Z_{\pi_k}(s_t)} \right) -\end{equation} +\htmlId{sac-policy-update}{\pi_{k+1} \leftarrow \arg\min_{\pi^\prime \in \Pi} \text{D}_{\text{KL}}\left(\pi^\prime (\bullet \vert s_t) \bigg\Vert \frac{\exp(Q_{\pi_k}(s_t, \bullet))}{Z_{\pi_k}(s_t)} \right)} ``` -The update rule provided in [eq:sac-policy-update] optimizes the policy while projecting it on a set $\Pi$ of tractable distributions (e.g., Gaussians, @haarnojaReinforcementLearningDeep2017). +The update rule provided in [sac-policy-update] optimizes the policy while projecting it on a set $\Pi$ of tractable distributions (e.g., Gaussians, @haarnojaReinforcementLearningDeep2017). ##### Sample-efficient, data-driven RL @@ -654,13 +636,13 @@ Lastly, in order to improve on the robustness of their approach to different goa src={ch3_hil_serl_examples} zoomable downloadable - id="fig:hil-serl-blocks" + id="hil-serl-blocks" layout="fixed" alt="(A) HIL-SERL allows for real-world training of high performance RL agents by building on top advance..." caption={'(A) HIL-SERL allows for real-world training of high performance RL agents by building on top advancements presented by of SAC, RLPD and SERL. (B) Example of human intervention during a HIL-SERL training process on a SO-100.'} /> -Building on off-policy deep Q-learning with replay buffers, entropy regularization for better exploration and performance, expert demonstrations to guide learning, and a series of tools and recommendations for real-world training using reward classifiers (Figure 18), @luoPreciseDexterousRobotic2024 introduce human interactions during training, learning near-optimal policies in challenging real-world manipulation tasks in 1-2 hours. +Building on off-policy deep Q-learning with replay buffers, entropy regularization for better exploration and performance, expert demonstrations to guide learning, and a series of tools and recommendations for real-world training using reward classifiers (Figure 18), @luoPreciseDexterousRobotic2024 introduce human interactions during training, learning near-optimal policies in challenging real-world manipulation tasks in 1-2 hours. Human in the Loop Sample Efficient Robot reinforcement Learning (HIL-SERL) @luoPreciseDexterousRobotic2024 augments offline-to-online RL with targeted human corrections during training, and employs prior data to (1) train a reward classifier and (2) bootstrap RL training on expert trajectories. While demonstrations provide the initial dataset seeding learning and constraining early exploration, interactive corrections allow a human supervisor to intervene on failure modes and supply targeted interventions to aid the learning process. Crucially, human interventions are stored in both the offline and online replay buffers, differently from the autonomous transitions generated at training time and stored in the online buffer only. Consequently, given an intervention timestep $k \in (0, T)$, length-$K$ human intervention data $\{ s^{\text{human}}_k, a^{\text{human}}_k, r^{\text{human}}_k, s^{\text{human}}_{k+1},\}_{k=1}^K$ is more likely to be sampled for off-policy learning than the data generated online during training, providing stronger supervision to the agent while still allowing for autonomous learning. Empirically, HIL-SERL attains near-perfect success rates on diverse manipulation tasks within 1-2 hours of training @luoPreciseDexterousRobotic2024, underscoring how offline datasets with online RL can markedly improve stability and data efficiency, and ultimately even allow real-world RL-training. @@ -696,21 +678,21 @@ TL;DR Behavioral Cloning provides a natural platform to learn from real-world in src={ch4_bc_trajectories} zoomable downloadable - id="fig:ch4-bc-trajectories" + id="ch4-bc-trajectories" layout="fixed" alt="(A) Average (with standard deviation) evolution of the actuation levels over the first 5 recorded ep..." caption={'(A) Average (with standard deviation) evolution of the actuation levels over the first 5 recorded episodes in lerobot/svla_so101_pickplace. Proprioperceptive state provide invaluable to determine the robot’s state during an episode. (B) Camera frames are also recorded alongside measurements on the robot’s state, capturing information about the robot’s interaction with its environment.'} /> -Learning from human demonstrations provides a pragmatic alternative to the reinforcement-learning pipeline discussed in Section 3. Indeed, in real-world robotics online exploration is typically costly and potentially unsafe, and designing (dense) reward signals is a brittle and task-specific process. In general, success detection itself may often require bespoke instrumentation, while episodic training demands reliable resets--all factors complicating training RL algorithms on hardware at scale. Behavioral Cloning (BC) sidesteps these constraints by casting control an imitation learning problem, leveraging previously collected expert demonstrations. Most notably, by learning to imitate autonomous systems naturally adhere to the objectives, preferences, and success criteria implicitly encoded in the data, which obviates reduces early-stage exploratory failures and obviates hand-crafted reward shaping altogether. +Learning from human demonstrations provides a pragmatic alternative to the reinforcement-learning pipeline discussed in Section 3. Indeed, in real-world robotics online exploration is typically costly and potentially unsafe, and designing (dense) reward signals is a brittle and task-specific process. In general, success detection itself may often require bespoke instrumentation, while episodic training demands reliable resets--all factors complicating training RL algorithms on hardware at scale. Behavioral Cloning (BC) sidesteps these constraints by casting control an imitation learning problem, leveraging previously collected expert demonstrations. Most notably, by learning to imitate autonomous systems naturally adhere to the objectives, preferences, and success criteria implicitly encoded in the data, which obviates reduces early-stage exploratory failures and obviates hand-crafted reward shaping altogether. -Formally, let $\mathcal D = \{ \tau^{(i)} \}_{i=1}^N$ be a set of expert trajectories, with $\tau^{(i)} = \{(o_t^{(i)}, a_t^{(i)})\}_{t=0}^{T_i}$ representing the $i$-th trajectory in $\mathcal D$, $o_t \in \mathcal O$ denoting observations (e.g., images and proprioception altogether), and $a_t \in \mathcal A$ the expert actions. Typically, observations $o \in \mathcal O$ consist of both image and proprioperceptive information, while actions $a \in \mathcal A$ represent control specifications for the robot to execute, e.g. a joint configuration. Note that differently from Section 3, in the imitation learning context $\mathcal D$ denotes an offline dataset collecting $N$ length-$T_i$ reward-free (expert) human trajectories $\tau^{(i)}$, and *not* the environment dynamics. Similarily, in this section $\tau^{(i)}$ represent a length-$T_i$ trajectory of observation-action pairs, which crucially *omits entirely any reward* information. Figure 19 graphically shows trajectories in terms of the average evolution of the actuation on the 6 joints over a group of teleoperated episodes for the SO-100 manipulator. Notice how proprioperceptive states are captured jointly with camera frames over the course of the recorded episodes, providing a unified high-frame rate collection of teleoperation data. Figure 20 shows $(o_t, a_t)$-pairs for the same dataset, with the actions performed by the human expert illustrated just alongside the corresponding observation. In principle, (expert) trajectories $\tau^{(i)}$ can have different lengths since demonstrations might exhibit multi-modal strategies to attain the same goal, resulting in possibly multiple, different behaviors. +Formally, let $\mathcal D = \{ \tau^{(i)} \}_{i=1}^N$ be a set of expert trajectories, with $\tau^{(i)} = \{(o_t^{(i)}, a_t^{(i)})\}_{t=0}^{T_i}$ representing the $i$-th trajectory in $\mathcal D$, $o_t \in \mathcal O$ denoting observations (e.g., images and proprioception altogether), and $a_t \in \mathcal A$ the expert actions. Typically, observations $o \in \mathcal O$ consist of both image and proprioperceptive information, while actions $a \in \mathcal A$ represent control specifications for the robot to execute, e.g. a joint configuration. Note that differently from Section 3, in the imitation learning context $\mathcal D$ denotes an offline dataset collecting $N$ length-$T_i$ reward-free (expert) human trajectories $\tau^{(i)}$, and *not* the environment dynamics. Similarily, in this section $\tau^{(i)}$ represent a length-$T_i$ trajectory of observation-action pairs, which crucially *omits entirely any reward* information. Figure 19 graphically shows trajectories in terms of the average evolution of the actuation on the 6 joints over a group of teleoperated episodes for the SO-100 manipulator. Notice how proprioperceptive states are captured jointly with camera frames over the course of the recorded episodes, providing a unified high-frame rate collection of teleoperation data. Figure 20 shows $(o_t, a_t)$-pairs for the same dataset, with the actions performed by the human expert illustrated just alongside the corresponding observation. In principle, (expert) trajectories $\tau^{(i)}$ can have different lengths since demonstrations might exhibit multi-modal strategies to attain the same goal, resulting in possibly multiple, different behaviors. -While conceptually elegant, point-estimate policies $f : \mathcal O\mapsto \mathcal A$ learned by solving [eq:loss-minimization-SL] have been observed to suffer from (1) compounding errors @rossReductionImitationLearning2011 and (2) poor fit to multimodal distributions @florenceImplicitBehavioralCloning2022, @keGraspingChopsticksCombating2020. Figure 21 illustrates these two key issues related to learning *explicit policies* @florenceImplicitBehavioralCloning2022. Besides sequentiality in $\mathcal D$, compounding errors due to *covariate shift* may also prove catastrophic, as even small $\epsilon$-prediction errors $0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon$ can quickly drive the policy into out-of-distribution states, incuring in less confident generations and thus errors compounding (Figure 21, left).Moreover, point-estimate policies typically fail to learn *multimodal* targets, which are very common in human demonstrations solving robotics problems, since multiple trajectories can be equally as good towards the accomplishment of a goal (e.g., symmetric grasps, Figure 21, right). In particular, unimodal regressors tend to average across modes, yielding indecisive or even unsafe commands @florenceImplicitBehavioralCloning2022. To address poor multimodal fitting, @florenceImplicitBehavioralCloning2022 propose learning the generative model $p(o, a)$ underlying the samples in $\mathcal D$, rather than an explicitly learning a prediction function $f(o) = a$. +While conceptually elegant, point-estimate policies $f : \mathcal O\mapsto \mathcal A$ learned by solving [loss-minimization-SL] have been observed to suffer from (1) compounding errors @rossReductionImitationLearning2011 and (2) poor fit to multimodal distributions @florenceImplicitBehavioralCloning2022, @keGraspingChopsticksCombating2020. Figure 21 illustrates these two key issues related to learning *explicit policies* @florenceImplicitBehavioralCloning2022. Besides sequentiality in $\mathcal D$, compounding errors due to *covariate shift* may also prove catastrophic, as even small $\epsilon$-prediction errors $0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon$ can quickly drive the policy into out-of-distribution states, incuring in less confident generations and thus errors compounding (Figure 21, left).Moreover, point-estimate policies typically fail to learn *multimodal* targets, which are very common in human demonstrations solving robotics problems, since multiple trajectories can be equally as good towards the accomplishment of a goal (e.g., symmetric grasps, Figure 21, right). In particular, unimodal regressors tend to average across modes, yielding indecisive or even unsafe commands @florenceImplicitBehavioralCloning2022. To address poor multimodal fitting, @florenceImplicitBehavioralCloning2022 propose learning the generative model $p(o, a)$ underlying the samples in $\mathcal D$, rather than an explicitly learning a prediction function $f(o) = a$. ### A (Concise) Introduction to Generative Models @@ -751,7 +730,7 @@ Generative Models (GMs) aim to learn the stochastic process underlying the very src={ch4_task_effect_on_pairs} zoomable downloadable - id="fig:ch4-task-effect-on-pairs" + id="ch4-task-effect-on-pairs" layout="fixed" alt="Intuitively, latent variable in a single latent model may contain information regarding the task bei..." caption={'Intuitively, latent variable in a single latent model may contain information regarding the task being performed, which directly results in the likelihood of the same observation-action pair being different for two different tasks. When (A) picking a block the likelihood of a wide gripper’s opening should be higher than narrower one, while it should be the opposite when (B) pushing the block.'} @@ -759,18 +738,15 @@ Generative Models (GMs) aim to learn the stochastic process underlying the very A common inductive bias used in GM posits samples $(o,a)$ are influenced from an unobservable latent variable $z \in Z$, resulting in ``` math -\begin - - p (o,a) = \int_{\text{supp}({Z})} p(o,a \vert z) p(z) -\end{equation} +\htmlId{BC-latent-variable}{p (o,a) = \int_{\text{supp}({Z})} p(o,a \vert z) p(z)} ``` -Intuitively, in the case of observation-action pairs $(o, a)$ for a robotics application, $z$ could be some high level representation of the underlying task being performed by the human demonstrator. In such case, treating $p(o,a)$ as a marginalization over $\text{supp}({Z})$ of the complete joint distribution $p(o,a,z)$ natively captures the effect different tasks have on the likelihood of observation-action pairs. Figure 22 graphically illustrates this concept in the case of a (A) picking and (B) pushing task, for which, nearing the target object, the likelihood of actions resulting in opening the gripper--the higher $q_6$, the wider the gripper’s opening--should intuitively be (A) high or (B) low, depending on the task performed. While the latent space $Z$ typically has a much richer structure than the set of all actual tasks performed, [eq:BC-latent-variable] still provides a solid framework to learn joint distribution conditioned on unobservable yet relevant factors. Figure 23 represents this framework of latent-variable for a robotics application: the true, $z$-conditioned generative process on assigns *likelihood* $p((o,a) \vert z)$ to the single $(o,a)$-pair. Using Bayes’ theorem, one can reconstruct the *posterior* distribution on $\text{supp}({Z})$, $q_\theta(z \vert o,a)$ from the likelihood $p_\theta(o,a \vert z)$, *prior* $p_\theta(z)$ and *evidence* $p_\theta(o,a)$. VAEs approximate the latent variable model presented in [eq:BC-latent-variable]) using an *approximate posterior* $q_\phi(z \vert o,a)$ while regressing parameters for a parametric likelihood, $p_\theta(o,a \vert z)$ (Figure 23). +Intuitively, in the case of observation-action pairs $(o, a)$ for a robotics application, $z$ could be some high level representation of the underlying task being performed by the human demonstrator. In such case, treating $p(o,a)$ as a marginalization over $\text{supp}({Z})$ of the complete joint distribution $p(o,a,z)$ natively captures the effect different tasks have on the likelihood of observation-action pairs. Figure 22 graphically illustrates this concept in the case of a (A) picking and (B) pushing task, for which, nearing the target object, the likelihood of actions resulting in opening the gripper--the higher $q_6$, the wider the gripper’s opening--should intuitively be (A) high or (B) low, depending on the task performed. While the latent space $Z$ typically has a much richer structure than the set of all actual tasks performed, [BC-latent-variable] still provides a solid framework to learn joint distribution conditioned on unobservable yet relevant factors. Figure 23 represents this framework of latent-variable for a robotics application: the true, $z$-conditioned generative process on assigns *likelihood* $p((o,a) \vert z)$ to the single $(o,a)$-pair. Using Bayes’ theorem, one can reconstruct the *posterior* distribution on $\text{supp}({Z})$, $q_\theta(z \vert o,a)$ from the likelihood $p_\theta(o,a \vert z)$, *prior* $p_\theta(z)$ and *evidence* $p_\theta(o,a)$. VAEs approximate the latent variable model presented in [BC-latent-variable]) using an *approximate posterior* $q_\phi(z \vert o,a)$ while regressing parameters for a parametric likelihood, $p_\theta(o,a \vert z)$ (Figure 23). [eq:BC-latent-variable] in [eq:evidence-definition-1], multiplied by $1 = \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)}$ in [eq:evidence-definition-2], and used the definition of expected value in [eq:evidence-definition]. + where we used [BC-latent-variable] in [evidence-definition-1], multiplied by $1 = \frac{q_\theta(z \vert (o,a)_i)}{q_\theta(z \vert (o,a)_i)}$ in [evidence-definition-2], and used the definition of expected value in [evidence-definition]. -In the special case where one assumes distributions to be tractable, $p_\theta (\mathcal D)$ is typically tractable too, and $\max_\theta \log p_\theta(\mathcal D)$ provides a natural target for (point-wise) infering the unknown parameters $\theta$ of the generative model. Unfortunately, [eq:evidence-definition] is rarely tractable when the distribution $p$ is modeled with approximators such as neural networks, especially for high-dimensional, unstructured data. +In the special case where one assumes distributions to be tractable, $p_\theta (\mathcal D)$ is typically tractable too, and $\max_\theta \log p_\theta(\mathcal D)$ provides a natural target for (point-wise) infering the unknown parameters $\theta$ of the generative model. Unfortunately, [evidence-definition] is rarely tractable when the distribution $p$ is modeled with approximators such as neural networks, especially for high-dimensional, unstructured data. -In their seminal work on Variational Auto-Encoders (VAEs), @kingmaAutoEncodingVariationalBayes2022 present two major contributions to learn complex latent-variable GMs on unstructured data, proposing (1) a tractable, variational lower-bound to [eq:evidence-definition] as an optimization target to jointly learn likelihood and posterior and (2) high-capacity function approximators to model the likelihood $p_\theta(o,a\vert z)$ and (approximate) posterior distribution $q_\phi(z \vert o,a) \approx q_\theta(z \vert o,a)$. +In their seminal work on Variational Auto-Encoders (VAEs), @kingmaAutoEncodingVariationalBayes2022 present two major contributions to learn complex latent-variable GMs on unstructured data, proposing (1) a tractable, variational lower-bound to [evidence-definition] as an optimization target to jointly learn likelihood and posterior and (2) high-capacity function approximators to model the likelihood $p_\theta(o,a\vert z)$ and (approximate) posterior distribution $q_\phi(z \vert o,a) \approx q_\theta(z \vert o,a)$. -In particular, the lower bound on [eq:evidence-definition] (Evidence LOwer Bound, *ELBO*) can be derived from [eq:evidence-definition] applying Jensen’s inequality--$\log \mathbb{E}[\bullet] \geq \mathbb{E} [\log (\bullet)]$--yielding: +In particular, the lower bound on [evidence-definition] (Evidence LOwer Bound, *ELBO*) can be derived from [evidence-definition] applying Jensen’s inequality--$\log \mathbb{E}[\bullet] \geq \mathbb{E} [\log (\bullet)]$--yielding: $$ -`\log p_\theta(\mathcal D) \geq \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] + \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} [\log \left( \frac{p(z)}{q_\theta(z \vert (o,a)_i)} \right)] \right)\\ = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - \text{D}_{\text{KL}}\big[ q_\theta(z \vert (o,a)_i) \Vert p(z) \big] \right) ` +`\htmlId{ELBO-intractable}{\log p_\theta(\mathcal D) \geq \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] + \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} [\log \left( \frac{p(z)}{q_\theta(z \vert (o,a)_i)} \right)] \right)\\ = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim p_\theta(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - \text{D}_{\text{KL}}\big[ q_\theta(z \vert (o,a)_i) \Vert p(z) \big] \right)}` $$ - The true, generally intractable posterior $p_\theta (z \vert o,a)$ prevents computing both the expectation and KL divergence terms in [eq:ELBO-intractable], and therefore @kingmaAutoEncodingVariationalBayes2022 propose deriving the ELBO using an *approximate* posterior $q_\phi(z \vert o,a)$, resulting in the final, tractable ELBO objective, $\text{ELBO}_{\mathcal D}(\theta, \phi) = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim q_\phi(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - \text{D}_{\text{KL}}\big[ q_\phi(z \vert (o,a)_i) \Vert p(z) \big] \right) $ From Jensen’s inequality, maximizing ELBO results in maximizing the log-likelihood of the data too, thus providing a natural, tractable optimization target. Indeed, expectations can be estimated using MC estimates from the learned distributions in [eq:ELBO], while the KL-divergence term can typically be computed in closed-form (1) modeling $q_\phi$ as a Gaussian $q_\phi(z \vert o,a) = \mathcal N\big(\mu_\phi(o,a), \Sigma_\phi(o,a) \big)$ and (2) imposing a standard Gaussian prior on the latent space, $p(z) = \mathcal N(\mathbf{0}, \mathbf{I})$. + The true, generally intractable posterior $p_\theta (z \vert o,a)$ prevents computing both the expectation and KL divergence terms in [ELBO-intractable], and therefore @kingmaAutoEncodingVariationalBayes2022 propose deriving the ELBO using an *approximate* posterior $q_\phi(z \vert o,a)$, resulting in the final, tractable ELBO objective, $\htmlId{ELBO}{\text{ELBO}_{\mathcal D}(\theta, \phi) = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim q_\phi(\cdot \vert (o,a)_i)} \big[ \log p_\theta((o,a)_i \vert z) \big] - \text{D}_{\text{KL}}\big[ q_\phi(z \vert (o,a)_i) \Vert p(z) \big] \right)}$ From Jensen’s inequality, maximizing ELBO results in maximizing the log-likelihood of the data too, thus providing a natural, tractable optimization target. Indeed, expectations can be estimated using MC estimates from the learned distributions in [ELBO], while the KL-divergence term can typically be computed in closed-form (1) modeling $q_\phi$ as a Gaussian $q_\phi(z \vert o,a) = \mathcal N\big(\mu_\phi(o,a), \Sigma_\phi(o,a) \big)$ and (2) imposing a standard Gaussian prior on the latent space, $p(z) = \mathcal N(\mathbf{0}, \mathbf{I})$. An intuitive explanation of the learning dynamics of VAEs can be given considering the equivalent case of *minimizing the negative ELBO*, which admits a particularly interpretable factorization $$ -`\min_{\theta, \phi} - \text{ELBO}_{\mathcal (o,a) \sim \mathcal D}(\theta, \phi) = \min_{\theta, \phi}\mathbf{L^{\text{rec}}}(\theta) + \mathbf{L^{\text{reg}}}(\phi)\\ \mathbf{L^{\text{rec}}}(\theta) = \mathbb{E}_{z \sim q_\phi(\cdot \vert o,a} \big[ \log p_\theta(o,a \vert z) \big]\\ \mathbf{L^{\text{reg}}}(\phi) = \text{D}_{\text{KL}}\big[ q_\phi(z \vert o,a) \Vert p(z) \big] ` +`\htmlId{VAE-min-neg-ELBO}{\min_{\theta, \phi} - \text{ELBO}_{\mathcal (o,a) \sim \mathcal D}(\theta, \phi) = \min_{\theta, \phi}\mathbf{L^{\text{rec}}}(\theta) + \mathbf{L^{\text{reg}}}(\phi)\\ \mathbf{L^{\text{rec}}}(\theta) = \mathbb{E}_{z \sim q_\phi(\cdot \vert o,a} \big[ \log p_\theta(o,a \vert z) \big]\\ \mathbf{L^{\text{reg}}}(\phi) = \text{D}_{\text{KL}}\big[ q_\phi(z \vert o,a) \Vert p(z) \big]}` $$ -For any given $(o,a)$ pair, the expected value term of [eq:VAE-Lrec] is typically computed via MC estimates, resulting in +For any given $(o,a)$ pair, the expected value term of [VAE-Lrec] is typically computed via MC estimates, resulting in ``` math -\mathbb{E}_{z \sim q_\phi(\bullet \vert o,a)} \big[ \log p_\theta(o,a \vert z) \big] = \mathbf{L^{\text{rec}}} \approx - \frac{1}{n} \sum_{i=0}^n \log p_\theta(o,a \vert z_i). ``` -Assuming $p_\theta(o,a \vert z)$ is parametrized as an isotropic Gaussian distribution with mean $\mu_\theta (z) \in \mathbb R^d$ and variance $\sigma^2$, the log-likelihood thus simplifies to: +Assuming $p_\theta(o,a \vert z)$ is parametrized as an isotropic Gaussian distribution with mean $\mu_\theta (z) \in \mathbb R^d$ and variance $\sigma^2$, the log-likelihood thus simplifies to- ``` math \log p(o,a \vert z_i) = -\frac{1}{2\sigma^{2}} \big \Vert (o,a)-\mu_\theta(z_i) \big\Vert_2^2 -\frac{d}{2}\log(2\pi \sigma^{2}) \implies \mathbf{L^\text{rec}} \approx \frac {1}{n} \sum_{i=0}^n \big\Vert (o,a) - \mu_\theta(z_i) \big \Vert^2_2 ``` -Indeed, it is very common in practice to approximate from the learned likelihood $p_\theta(o,a \vert z)$ as a parametric distribution (e.g. Gaussians) parametrized by some learned vector of coefficients derived from $\mu_\theta (z), \ z \sim p (\bullet)$. In all such cases, learning a VAE corresponds to optimally *reconstructing* the examples in $\mathcal D$ by minimizing the L2-error--a very common *supervised learning* objective for regression targets--while regularizing the information compression into the latent, as under the common modeling choice $p(z) = \mathcal N (\mathbf{0}, \mathbf{I})$ [eq:VAE-Lreg] regularizes the posterior limiting the expressivity of $q_\phi(z\vert o,a)$. +Indeed, it is very common in practice to approximate from the learned likelihood $p_\theta(o,a \vert z)$ as a parametric distribution (e.g. Gaussians) parametrized by some learned vector of coefficients derived from $\mu_\theta (z), \ z \sim p (\bullet)$. In all such cases, learning a VAE corresponds to optimally *reconstructing* the examples in $\mathcal D$ by minimizing the L2-error--a very common *supervised learning* objective for regression targets--while regularizing the information compression into the latent, as under the common modeling choice $p(z) = \mathcal N (\mathbf{0}, \mathbf{I})$ [VAE-Lreg] regularizes the posterior limiting the expressivity of $q_\phi(z\vert o,a)$. #### Diffusion Models -VAEs approximate probability distributions via a *single* latent variable model, assuming the underlying unknown distribution can be factored according to [eq:BC-latent-variable], and solve the variational inference problem of jointly learning the likelihood $p_\theta$ and (approximate) posterior $q_\phi$ for such model. In that, the unknown data distribution $p(o,a)$ is effectively approximated via $\int_Z p(z) p_\theta(o,a \vert z)$, and the underlying generative process reproduced by (1) sampling a latent variable and (2) learning to decode it into a (ideally) high-likelihood sample under the (unknown) $p(o,a)$. Diffusion Models (DMs) @hoDenoisingDiffusionProbabilistic2020 are another class of GMs which treat the similar problem of approximating an underlying unknown data distribution--*variational inference*--by *partially* extending VAEs to the case where *multiple* latent variables influence each other and the generative process underlying $o,a$ itself. In particular, DMs posit the generative process can be decomposed to a series of piece-wise (Markovian) interactions between (latent) variables (Figure 24), resulting in +VAEs approximate probability distributions via a *single* latent variable model, assuming the underlying unknown distribution can be factored according to [BC-latent-variable], and solve the variational inference problem of jointly learning the likelihood $p_\theta$ and (approximate) posterior $q_\phi$ for such model. In that, the unknown data distribution $p(o,a)$ is effectively approximated via $\int_Z p(z) p_\theta(o,a \vert z)$, and the underlying generative process reproduced by (1) sampling a latent variable and (2) learning to decode it into a (ideally) high-likelihood sample under the (unknown) $p(o,a)$. Diffusion Models (DMs) @hoDenoisingDiffusionProbabilistic2020 are another class of GMs which treat the similar problem of approximating an underlying unknown data distribution--*variational inference*--by *partially* extending VAEs to the case where *multiple* latent variables influence each other and the generative process underlying $o,a$ itself. In particular, DMs posit the generative process can be decomposed to a series of piece-wise (Markovian) interactions between (latent) variables (Figure 24), resulting in $$ -`p(\underbrace{o,a}_{= z_0}) = \int_{\text{supp}({Z_0})} \int_{\text{supp}({Z_1})} \ldots \int_{\text{supp}({Z_T})} p(z_0, z_1, \dots z_T)\\ p(z_0, z_1, \dots z_T) = p(z_T) \prod_{t=0}^{T} p(z_{t-1} \vert z_t), ` +`\htmlId{BC-multi-latent-model-1}{p(\underbrace{o,a}_{= z_0}) = \int_{\text{supp}({Z_0})} \int_{\text{supp}({Z_1})} \ldots \int_{\text{supp}({Z_T})} p(z_0, z_1, \dots z_T)\\ p(z_0, z_1, \dots z_T) = p(z_T) \prod_{t=0}^{T} p(z_{t-1} \vert z_t),}` $$ - where we explicitly showed the marginalization over the multiple latents in [eq:BC-multi-latent-model-1], and used the law of conditional probability and Markov property in [eq:BC-multi-latent-model-2]. + where we explicitly showed the marginalization over the multiple latents in [BC-multi-latent-model-1], and used the law of conditional probability and Markov property in [BC-multi-latent-model-2]. [eq:BC-multi-latent-model-1]). Similarily to VAEs, DMs approximate the process of sampling from the unknown $p(o,a)$ (1) sampling from an easy-to-sample distribution (e.g., Gaussian) and (2) learning to reconstruct high-likelihood samples under the unknown distribution. However, in stark contrast with VAEs, the easy-to-sample distribution contains *no mutual information* regarding the data distribution $p(o,a)$. Crucially, as no information from the sample $(o,a)$ (denoted as $z_0 \equiv (o,a)$ for the sake of notation) is assumed to be propagated throughout the chain of latents, the posterior $q(z_t \vert z_{t-1})$ assumes a relatively amicable structure in DMs, reducing complexity. The *true* likelihood $p(z_{t-1} \vert z_t)$ is instead typically approximated using the parametrization $p_\theta (z_{t-1} \vert z_t)$. In that, the information contained in the unknwon data distribution is *reconstructed* via a process in which samples from a fixed distribution are turned into (ideally) high-likelihood samples under $p(o,a)$--a process referred to as *denoising*. +Just like VAEs, DMs attemp to learn to reproduce an underlying data distribution $p (o,a)$ given a collection of i.i.d. samples approximating the model posited to have generated the data in the first place ( [BC-multi-latent-model-1]). Similarily to VAEs, DMs approximate the process of sampling from the unknown $p(o,a)$ (1) sampling from an easy-to-sample distribution (e.g., Gaussian) and (2) learning to reconstruct high-likelihood samples under the unknown distribution. However, in stark contrast with VAEs, the easy-to-sample distribution contains *no mutual information* regarding the data distribution $p(o,a)$. Crucially, as no information from the sample $(o,a)$ (denoted as $z_0 \equiv (o,a)$ for the sake of notation) is assumed to be propagated throughout the chain of latents, the posterior $q(z_t \vert z_{t-1})$ assumes a relatively amicable structure in DMs, reducing complexity. The *true* likelihood $p(z_{t-1} \vert z_t)$ is instead typically approximated using the parametrization $p_\theta (z_{t-1} \vert z_t)$. In that, the information contained in the unknwon data distribution is *reconstructed* via a process in which samples from a fixed distribution are turned into (ideally) high-likelihood samples under $p(o,a)$--a process referred to as *denoising*. Under such model, we can express the log-likelihood of an arbitrary sample as[^4] $$ -`\log p_\theta (\underbrace{o,a}_{= z_0}) = \mathbb{E}_{z_1 \sim q(\bullet \vert z_0)} \log p_\theta (z_0 \vert z_1) -\\ \mathbb{E}_{z_{T-1} \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_T \vert z_{T-1}) \Vert p(z_T) ) \big] - \notag\\ \sum_{t=1}^{T-1} \mathbb{E}_{(z_{t-1}, z_{t+1}) \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_t \vert z_{t-1}) \Vert p_\theta(z_t \vert z_{t-1}) ) \big], \notag` +`\htmlId{diffusion-likelihood}{\log p_\theta (\underbrace{o,a}_{= z_0}) = \mathbb{E}_{z_1 \sim q(\bullet \vert z_0)} \log p_\theta (z_0 \vert z_1) -\\ \mathbb{E}_{z_{T-1} \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_T \vert z_{T-1}) \Vert p(z_T) ) \big] - \notag\\ \sum_{t=1}^{T-1} \mathbb{E}_{(z_{t-1}, z_{t+1}) \sim q(\bullet \vert z_0)} \big[ \text{D}_{\text{KL}}(q(z_t \vert z_{t-1}) \Vert p_\theta(z_t \vert z_{t-1}) ) \big], \notag}` $$ providing an optimization target in the form of $\max_\theta \log p_\theta (\mathcal D)$. -In their seminal work on using DMs for variational inference, @hoDenoisingDiffusionProbabilistic2020 introduce major contributions regarding solving $\min_\theta -\log p_\theta(o,a)$. In particular, @hoDenoisingDiffusionProbabilistic2020 exclusively adopt a fixed *Gaussian* posterior in the form of $q(z_t \vert z_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}z_{t-1}, \beta_t \mathbf I)$. The choice of adopting Gaussians has profound implications on the generative process modeled. Indeed, under the (mild) assumption that the variance is sufficiently small $\beta_t \leq \eta, \eta \in \mathbb R^+$, @sohl-dicksteinDeepUnsupervisedLearning2015 proved that the likelihood $p(z_{t-1} \vert z_t)$ is Gaussian as well, which allows for the particularly convenient parametrization of the approximate likelihood $p_\theta (x_{t-1} \vert x_t) = \mathcal N(\mu_\theta(x_t, t), \Sigma_\theta(x_t,t)), \ t \in [1,T]$, as well as for closed-form tractability of the KL-divergence terms in [eq:diffusion-likelihood]. Further, the posterior’s structure also enables an analytical description for the distribution of the $t$-th latent variable, $q(z_t \vert z_0) = \mathcal N (\sqrt{\bar{\alpha}_t}z_0, (1-\bar{\alpha}_t) \mathbf{I})$, with $\alpha_t = 1-\beta_t, \ \bar \alpha_t = \prod_{k=1}^t \alpha_k$, which conveniently prevents iterative posterior sampling. +In their seminal work on using DMs for variational inference, @hoDenoisingDiffusionProbabilistic2020 introduce major contributions regarding solving $\min_\theta -\log p_\theta(o,a)$. In particular, @hoDenoisingDiffusionProbabilistic2020 exclusively adopt a fixed *Gaussian* posterior in the form of $q(z_t \vert z_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}z_{t-1}, \beta_t \mathbf I)$. The choice of adopting Gaussians has profound implications on the generative process modeled. Indeed, under the (mild) assumption that the variance is sufficiently small $\beta_t \leq \eta, \eta \in \mathbb R^+$, @sohl-dicksteinDeepUnsupervisedLearning2015 proved that the likelihood $p(z_{t-1} \vert z_t)$ is Gaussian as well, which allows for the particularly convenient parametrization of the approximate likelihood $p_\theta (x_{t-1} \vert x_t) = \mathcal N(\mu_\theta(x_t, t), \Sigma_\theta(x_t,t)), \ t \in [1,T]$, as well as for closed-form tractability of the KL-divergence terms in [diffusion-likelihood]. Further, the posterior’s structure also enables an analytical description for the distribution of the $t$-th latent variable, $q(z_t \vert z_0) = \mathcal N (\sqrt{\bar{\alpha}_t}z_0, (1-\bar{\alpha}_t) \mathbf{I})$, with $\alpha_t = 1-\beta_t, \ \bar \alpha_t = \prod_{k=1}^t \alpha_k$, which conveniently prevents iterative posterior sampling. -Finally, adopting Gaussian posteriors permits a particularly pleasing interpretation of the dynamics of training DMs @permenterInterpretingImprovingDiffusion2024. By using Gaussian posteriors, the hierarchical latent variables effectively lose increasingly more information circa the original (unknown) distribution’s sample, $z_0$, increasingly distributing according to a standard Gaussian and thus containing no information at all (Figure 25). Figure 25 illustrates this procedure on a simplified, bidimensional observation-action distribution, where we considered $o=q_2$ and $a=q^h_2$, with $q_2$ representing the robot’s *elbow flex* actuation and $q^h_2$ the human teleoperator’s robot elbow flex. +Finally, adopting Gaussian posteriors permits a particularly pleasing interpretation of the dynamics of training DMs @permenterInterpretingImprovingDiffusion2024. By using Gaussian posteriors, the hierarchical latent variables effectively lose increasingly more information circa the original (unknown) distribution’s sample, $z_0$, increasingly distributing according to a standard Gaussian and thus containing no information at all (Figure 25). Figure 25 illustrates this procedure on a simplified, bidimensional observation-action distribution, where we considered $o=q_2$ and $a=q^h_2$, with $q_2$ representing the robot’s *elbow flex* actuation and $q^h_2$ the human teleoperator’s robot elbow flex. -Because the recorded behavior is teleoperated, measurements mostly distribute along the line $a = o + \eta, \eta \sim N(0,1)$, with $\eta$-variability accouting for minor control inconsistencies (Figure 26). Using Gaussian posteriors--i.e., adding Gaussian noise--effectively simulates a *Brownian motion* for the elements in the distribution’s support (in Figure 25, $\mathcal O\times \mathcal A$), whereby information *diffuses away* from the samples, and comparing the diffused samples to the original data points one can derive an estimate of the total displacement induced by diffusion. Under the only assumption that the likelihood of the diffused samples is low under the original unknown data distribution, then one can effectively approximate the unkwown distribution by learning to *reverse* such displacement. This key intuition allows to write a simplified training objective: $ \mathcal L(\theta) = \mathbb{E}_{t, z_0, \epsilon} \big[ \Vert \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} z_0 + \epsilon \sqrt{1 - \bar \alpha_t}, t) \Vert^2 \big], \quad t \sim \mathcal{U}(\{1,\dots,T\}), \quad z_0 \sim \mathcal{D}, \quad \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I}).$ +Because the recorded behavior is teleoperated, measurements mostly distribute along the line $a = o + \eta, \eta \sim N(0,1)$, with $\eta$-variability accouting for minor control inconsistencies (Figure 26). Using Gaussian posteriors--i.e., adding Gaussian noise--effectively simulates a *Brownian motion* for the elements in the distribution’s support (in Figure 25, $\mathcal O\times \mathcal A$), whereby information *diffuses away* from the samples, and comparing the diffused samples to the original data points one can derive an estimate of the total displacement induced by diffusion. Under the only assumption that the likelihood of the diffused samples is low under the original unknown data distribution, then one can effectively approximate the unkwown distribution by learning to *reverse* such displacement. This key intuition allows to write a simplified training objective: $\htmlId{diffusion-simplified-loss}{\mathcal L(\theta) = \mathbb{E}_{t, z_0, \epsilon} \big[ \Vert \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} z_0 + \epsilon \sqrt{1 - \bar \alpha_t}, t) \Vert^2 \big], \quad t \sim \mathcal{U}(\{1,\dots,T\}), \quad z_0 \sim \mathcal{D}, \quad \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I}).}$ -In this simplified (minimization) objective, the optimization process differs from [eq:diffusion-likelihood] in that, rather than maxizing $p_\theta$ directly, the parameters $\theta$ of the pairwise likelihood $p_\theta(z_{t-1} \vert z_t)$ are adjusted to *predict the total displacement* $\epsilon$ for a randomly long ($t \sim \mathcal{U}(\{1,\dots,T\}$ )) diffusion process starting from a sample of the target distribution. +In this simplified (minimization) objective, the optimization process differs from [diffusion-likelihood] in that, rather than maxizing $p_\theta$ directly, the parameters $\theta$ of the pairwise likelihood $p_\theta(z_{t-1} \vert z_t)$ are adjusted to *predict the total displacement* $\epsilon$ for a randomly long ($t \sim \mathcal{U}(\{1,\dots,T\}$ )) diffusion process starting from a sample of the target distribution. -By learning the total displacement from a generally, uninformative corrupted sample obtained diffusing information and a sample from an unknown distribution--significant ($\Vert \epsilon \Vert > 0$) whenever input and target distribution are sufficiently different-- @hoDenoisingDiffusionProbabilistic2020 show that one can approximate the underlying distribution reversing the displacement, *denoising* samples. Interestingly, under the hypothesis real-world data belongs to a single higher dimensional manifold (Manifold Hypothesis), @permenterInterpretingImprovingDiffusion2024 show that diffusion learns the gradient of a distance function from any off-point manifold (such as perturbed, uniformative samples), and the data manifold itself. Following this gradient--i.e., denoising a sample from an uninformative distribution--corresponds to projecting back into the manifold, yielding a procedure to sample from unknown distributions by means of Euclidean projection. Indeed, under the assumption that $p_\theta (z_{t-1} \vert z_t)$ is Gaussian, then sampling $z_{t-1} \sim p_\theta(\bullet \vert z_{t})$ corresponds to computing $z_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( z_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \epsilon_\theta(z_t, t) \right) + \sigma_t \epsilon, \quad \epsilon \sim \mathcal N(\mathbf{0}, \mathbf{I}), $ thus showing that the lower-level latent variables in a DM can be obtained by iteratively removing noise from the one-step higher order variable, using the noise regressor $\epsilon_\theta(z_t, t)$ learned minimizing [eq:diffusion-simplified-loss]. +By learning the total displacement from a generally, uninformative corrupted sample obtained diffusing information and a sample from an unknown distribution--significant ($\Vert \epsilon \Vert > 0$) whenever input and target distribution are sufficiently different-- @hoDenoisingDiffusionProbabilistic2020 show that one can approximate the underlying distribution reversing the displacement, *denoising* samples. Interestingly, under the hypothesis real-world data belongs to a single higher dimensional manifold (Manifold Hypothesis), @permenterInterpretingImprovingDiffusion2024 show that diffusion learns the gradient of a distance function from any off-point manifold (such as perturbed, uniformative samples), and the data manifold itself. Following this gradient--i.e., denoising a sample from an uninformative distribution--corresponds to projecting back into the manifold, yielding a procedure to sample from unknown distributions by means of Euclidean projection. Indeed, under the assumption that $p_\theta (z_{t-1} \vert z_t)$ is Gaussian, then sampling $z_{t-1} \sim p_\theta(\bullet \vert z_{t})$ corresponds to computing $\htmlId{diffusion-denoising-definition}{z_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( z_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \epsilon_\theta(z_t, t) \right) + \sigma_t \epsilon, \quad \epsilon \sim \mathcal N(\mathbf{0}, \mathbf{I}),}$ thus showing that the lower-level latent variables in a DM can be obtained by iteratively removing noise from the one-step higher order variable, using the noise regressor $\epsilon_\theta(z_t, t)$ learned minimizing [diffusion-simplified-loss]. #### Flow Matching @@ -890,10 +866,7 @@ $$ FM proved very effective in a variety of applications, ranging from image @esserScalingRectifiedFlow2024 and video generation @polyakMovieGenCast2025 to robotics control @blackp0VisionLanguageActionFlow2024. Most notably, in their introductory work on FM for GM, @lipmanFlowMatchingGenerative2023 show how DMs can be seen as a specific instance of FM where the *conditional* target vector field $u$ approximated by the noise regressor corresponds to ``` math -\begin{equation} - - u(t, z\vert z_0) = \frac{\frac{d}{dt}\alpha(1-t)}{1 - (\alpha(1-t))^2}(\alpha(1-t)z - z_0), \quad \alpha(t) = e^{-\frac12 \int_0^t \beta(s) ds}, \quad \forall z_0 \in \mathcal D -\end{equation} +\htmlId{fm-diffusion-vector-field}{u(t, z\vert z_0) = \frac{\frac{d}{dt}\alpha(1-t)}{1 - (\alpha(1-t))^2}(\alpha(1-t)z - z_0), \quad \alpha(t) = e^{-\frac12 \int_0^t \beta(s) ds}, \quad \forall z_0 \in \mathcal D} ``` Note that the traditional discrete-time noise-scheduler ${\beta_t}_{t=0}^T$ is now generalized to a continuous map $\beta : [0,1] \mapsto \mathbb R^+$. Crucially, @lipmanFlowMatchingGenerative2023 prove that by exclusively optimizing the vector field for individual data points $z_0 \in \mathcal D$ individually, one also retrieves the optimal flow to morph the entire support of the initial distribution $p_0$ into $p_1 \ \text{s.t.} \mathcal D \sim p_1$. @@ -901,33 +874,33 @@ Note that the traditional discrete-time noise-scheduler ${\beta_t}_{t=0}^T$ is n src={ch4_normalizing_flows} zoomable downloadable - id="fig:ch4-normalizing-flows" + id="ch4-normalizing-flows" layout="fixed" alt="Probability distributions can be modified applying vector fields resulting in a flow of mass in the ..." caption={'Probability distributions can be modified applying vector fields resulting in a flow of mass in the support. When acting over time, vector fields can effectively change the distribution’s structure.'} /> -While the noising schedule of DMs results in a stochastic process that resembles a random walk, FM allows for more general--potentially, deterministic--likelihood and posterior parametrization. In the FM literature the likelihood and posterior probabilty densities defined along a HMLV model are typically jointly referred to as a *probability path*, where the distributions for successive adjacent transitions in the HMLV model are related by the (normalized) flow between them (Figure 27). The inherent flexibility of FM is one of their key advantages over DMs, as it opens up the possibility of *learning* more efficient paths. For instance, one can design probability paths inspired by Optimal Transport (OT)--a subdiscipline studying the problem of finding the most efficient way to morph one probability distribution into another. Probability paths obtained through OT paths tend to be *straighter* than diffusion paths (Figure 28), which can lead to faster and more stable training, as well as higher-quality sample generation with fewer steps at inference time. By avoiding unnecessary backtracking associated with the inherent stochastic nature of both the noising and denoising process in DMs, test-time compute is typically significantly reduced, while retaining comparable results @lipmanFlowMatchingGenerative2023. +While the noising schedule of DMs results in a stochastic process that resembles a random walk, FM allows for more general--potentially, deterministic--likelihood and posterior parametrization. In the FM literature the likelihood and posterior probabilty densities defined along a HMLV model are typically jointly referred to as a *probability path*, where the distributions for successive adjacent transitions in the HMLV model are related by the (normalized) flow between them (Figure 27). The inherent flexibility of FM is one of their key advantages over DMs, as it opens up the possibility of *learning* more efficient paths. For instance, one can design probability paths inspired by Optimal Transport (OT)--a subdiscipline studying the problem of finding the most efficient way to morph one probability distribution into another. Probability paths obtained through OT paths tend to be *straighter* than diffusion paths (Figure 28), which can lead to faster and more stable training, as well as higher-quality sample generation with fewer steps at inference time. By avoiding unnecessary backtracking associated with the inherent stochastic nature of both the noising and denoising process in DMs, test-time compute is typically significantly reduced, while retaining comparable results @lipmanFlowMatchingGenerative2023. -In practice, FM can be applied to generative modeling by learning a vector field regressor $v_\theta(z, t)$ to approximate a given target vector field $u(t, z)$. In the particular case of DMs, $u(t, z)$ is defined as in [eq:fm-diffusion-vector-field], while in priciple the target vector field can be learned to induce a particular transportation, or fixed according to OT. Given a sample from the data distribution $z_1 \sim p_1$ and a sample from an easy-to-sample prior $z_0 \sim p_0$, CFM defines a simple path between them using *linear interpolation* between samples $z_t = (1-t)z_0 + t z_1$, resulting in the target vector field $u(t, z_t) = z_1 - z_0$. Then, a FM model can be trained with the simple regression objective defined as $ \mathcal L(\theta) = \mathbb{E}_{t, z_0, z_1} \big[ \Vert v_\theta((1-t)z_0 + t z_1, t) - (z_1 - z_0) \Vert^2 \big], \quad t \sim \mathcal{U}([0,1]),$ where $z_0 \sim p_0(\bullet)$ and $z_1 \sim p_1(\bullet)$. Note how in [eq:flow-matching-objective]--differently from [eq:diffusion-simplified-loss]--time is assumed to be varying continuously $t \sim \mathcal U([0,1])$ rather than discretely $t \sim \mathcal U(\{0,1\})$, a key property of flow-based models. The objective in [eq:flow-matching-objective] directly regresses the learned vector field onto the simple, straight path connecting a point from the prior and a point from the data, providing a simulation-free training procedure that is both stable and efficient. At inference time, samples are generated by starting with $z_0 \sim p_0$ and iteratively refined according to $\frac{dz}{dt} = v_\theta(z_t, t)$ for $t \in [0,1]$--an operation that can be numerically carried out with standard ODE solvers. +In practice, FM can be applied to generative modeling by learning a vector field regressor $v_\theta(z, t)$ to approximate a given target vector field $u(t, z)$. In the particular case of DMs, $u(t, z)$ is defined as in [fm-diffusion-vector-field], while in priciple the target vector field can be learned to induce a particular transportation, or fixed according to OT. Given a sample from the data distribution $z_1 \sim p_1$ and a sample from an easy-to-sample prior $z_0 \sim p_0$, CFM defines a simple path between them using *linear interpolation* between samples $z_t = (1-t)z_0 + t z_1$, resulting in the target vector field $u(t, z_t) = z_1 - z_0$. Then, a FM model can be trained with the simple regression objective defined as $\htmlId{flow-matching-objective}{\mathcal L(\theta) = \mathbb{E}_{t, z_0, z_1} \big[ \Vert v_\theta((1-t)z_0 + t z_1, t) - (z_1 - z_0) \Vert^2 \big], \quad t \sim \mathcal{U}([0,1]),}$ where $z_0 \sim p_0(\bullet)$ and $z_1 \sim p_1(\bullet)$. Note how in [flow-matching-objective]--differently from [diffusion-simplified-loss]--time is assumed to be varying continuously $t \sim \mathcal U([0,1])$ rather than discretely $t \sim \mathcal U(\{0,1\})$, a key property of flow-based models. The objective in [flow-matching-objective] directly regresses the learned vector field onto the simple, straight path connecting a point from the prior and a point from the data, providing a simulation-free training procedure that is both stable and efficient. At inference time, samples are generated by starting with $z_0 \sim p_0$ and iteratively refined according to $\frac{dz}{dt} = v_\theta(z_t, t)$ for $t \in [0,1]$--an operation that can be numerically carried out with standard ODE solvers. ### Action Chunking with Transformers -While GMs prove useful in learning complex, high-dimensional multi-modal distributions, they do not natively address the compouding errors problem characteristic of online, sequential predictions. In Action Chunking with Transformers (ACT), @zhaoLearningFineGrainedBimanual2023 present an application of VAEs to the problem of learning purely from offline trajectories, introduce a simple, yet effective method to mitigate error compounding, learning high-fidelity autonomous behaviors. Drawing inspiration from how humans plan to enact atomically sequences of the kind $a_{t:t+k}$ instead of single actions $a_t$, @zhaoLearningFineGrainedBimanual2023 propose learning a GM on a dataset of input demonstrations by modeling *action chunks*. Besides contributions to learning high-performance autonomous behaviors, @zhaoLearningFineGrainedBimanual2023 also introduce hardware contributions in the form of a low-cost bimanual robot setup (ALOHA) capable of performing fine-grained manipulation tasks, such as opening a lid, slotting a battery in its allotment or even prepare tape for application. +While GMs prove useful in learning complex, high-dimensional multi-modal distributions, they do not natively address the compouding errors problem characteristic of online, sequential predictions. In Action Chunking with Transformers (ACT), @zhaoLearningFineGrainedBimanual2023 present an application of VAEs to the problem of learning purely from offline trajectories, introduce a simple, yet effective method to mitigate error compounding, learning high-fidelity autonomous behaviors. Drawing inspiration from how humans plan to enact atomically sequences of the kind $a_{t-t+k}$ instead of single actions $a_t$, @zhaoLearningFineGrainedBimanual2023 propose learning a GM on a dataset of input demonstrations by modeling *action chunks*. Besides contributions to learning high-performance autonomous behaviors, @zhaoLearningFineGrainedBimanual2023 also introduce hardware contributions in the form of a low-cost bimanual robot setup (ALOHA) capable of performing fine-grained manipulation tasks, such as opening a lid, slotting a battery in its allotment or even prepare tape for application. On the robot learning side of their contributions, @zhaoLearningFineGrainedBimanual2023 adopt transformers as the architectural backbone to learn a *Conditional* VAE @sohnLearningStructuredOutput2015. Conditional VAEs are a variation of the more standard VAE formulation introducing a conditioning variable on sampling from the latent prior, allowing the modeling of *one-to-many* relationships between latent and data samples. Further, in stark contrast with previous work @florenceImplicitBehavioralCloning2022, @jannerPlanningDiffusionFlexible2022, @zhaoLearningFineGrainedBimanual2023 do not learn a full joint $p_\theta(o,a)$ on observation and actions. While the *policy* distribution $p_\theta(a \vert o)$ can in principle be entirely described from its joint $p_\theta(o,a)$, it is often the case that the conditional distribution is intractable when using function approximators, as $p_\theta(a \vert o) = \tfrac{p_\theta(o,a)}{\int_\mathcal Ap_\theta(o,a)}$ and the integral in the denominator is typically intractable. Instead of modeling the full joint using a vanilla VAE, @zhaoLearningFineGrainedBimanual2023 propose learning a *conditional* VAE @sohnLearningStructuredOutput2015 modeling the policy distribution directly $p (a \vert o)$. -In practice, when learning from demonstrations adopting CVAEs results in a slight modification to the VAE objective in [eq:ELBO], which is adapted to $ \text{ELBO}_{\mathcal D}(\theta, \phi, \omega) = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim q_\phi(\cdot \vert o_i, a_i)} \big[ \log p_\theta(a_i \vert z, o_i) \big] - \text{D}_{\text{KL}}\big[ q_\phi(z \vert o_i, a_i) \Vert p_\omega(z \vert o_i) \big] \right)$ Notice how in [eq:c-ELBO] we are now also learning a new set of parameters $\omega$ for the prior distribution in the latent space. Effectively, this enables conditioning latent-space sampling (and thus reconstruction) during training, and potentially inference, providing useful when learning inherently conditional distributions like policies. Further, ACT is trained as a $\beta$-CVAE @higgins2017beta, using a weight of the KL regularization term in [eq:c-ELBO] as an hyperparameter regulating the information condensed in the latent space, where higher $\beta$ results in a less expressive latent space. +In practice, when learning from demonstrations adopting CVAEs results in a slight modification to the VAE objective in [ELBO], which is adapted to $\htmlId{c-ELBO}{\text{ELBO}_{\mathcal D}(\theta, \phi, \omega) = \sum_{i=0}^{N} \left( \mathbb{E}_{z \sim q_\phi(\cdot \vert o_i, a_i)} \big[ \log p_\theta(a_i \vert z, o_i) \big] - \text{D}_{\text{KL}}\big[ q_\phi(z \vert o_i, a_i) \Vert p_\omega(z \vert o_i) \big] \right)}$ Notice how in [c-ELBO] we are now also learning a new set of parameters $\omega$ for the prior distribution in the latent space. Effectively, this enables conditioning latent-space sampling (and thus reconstruction) during training, and potentially inference, providing useful when learning inherently conditional distributions like policies. Further, ACT is trained as a $\beta$-CVAE @higgins2017beta, using a weight of the KL regularization term in [c-ELBO] as an hyperparameter regulating the information condensed in the latent space, where higher $\beta$ results in a less expressive latent space. In their work, @zhaoLearningFineGrainedBimanual2023 ablated using a GM to learn from human demonstrations compared to a simpler, supervised objective, $\mathcal L_1(a,a^\prime) = \Vert a - a^\prime \Vert_1$. Interestingly, they found the performance of these two approaches to be comparable when learning from *scripted* demonstrations. That is, when learning from data collected rolling out a predetermined set of commands $[q^c_0, q^c_1, \dots]$, GM did *not* prove competitive compared to standard supervised learning. However, when learning from human demonstrations--i.e., from data collected executing commands coming from a human controller $[q^h_0, q^h_1, \dots]$--they found performance (success rate on a downstream task) to be severily (-33.3%) hindered from adopting a standard supervised learning objective compared to a richer, potentially more complex to learn variational objective, in keeping with the multimodal nature of human demonstrations data and findings presented in @florenceImplicitBehavioralCloning2022. The authors also ablate the action chunking paradigm, reporting significant performance gains for performing action chunking (1% vs. 44% success rate). To avoid acting openloop, @zhaoLearningFineGrainedBimanual2023 design an inference process consisting in performing inference at every timestep $t$ and then aggregate overlapping chunks using chunks’ exponential moving average. @@ -935,19 +908,19 @@ In their work, @zhaoLearningFineGrainedBimanual2023 ablated using a GM to learn src={ch4_act} zoomable downloadable - id="fig:ch4-act" + id="ch4-act" layout="fixed" alt="Action Chunking with Transformer (ACT), as in @zhaoLearningFineGrainedBimanual2023. ACT introduces a..." caption={'Action Chunking with Transformer (ACT), as in @zhaoLearningFineGrainedBimanual2023. ACT introduces an action chunking paradigm to cope with high-dimensional multi-modal demonstration data, and a transformer-based CVAE architecture.'} /> -In ACT (Figure 29), inference for a given observation $o \in \mathcal O$ could be performed by (1) computing a prior $p_\omega(z \vert o)$ for the latent and (2) decoding an action chunk from a sampled latent $z \sim p_\omega(\bullet \vert o)$, similarily to how standard VAEs generate samples, with the exception that vanilla VAEs typically pose $p(z\vert o) \equiv p(z) \sim N(\mathbf{0}, \mathbf{I})$ and thus skip (1). +In ACT (Figure 29), inference for a given observation $o \in \mathcal O$ could be performed by (1) computing a prior $p_\omega(z \vert o)$ for the latent and (2) decoding an action chunk from a sampled latent $z \sim p_\omega(\bullet \vert o)$, similarily to how standard VAEs generate samples, with the exception that vanilla VAEs typically pose $p(z\vert o) \equiv p(z) \sim N(\mathbf{0}, \mathbf{I})$ and thus skip (1). [eq:diffusion-simplified-loss] on a stack of $T_o$ observations, resulting in the *conditional* simplified diffusion objective +In practice, conditioning on observation data is achieved conditioning the added noise regressor $\epsilon_\theta$ introduced in [diffusion-simplified-loss] on a stack of $T_o$ observations, resulting in the *conditional* simplified diffusion objective $$ -`\mathcal L(\theta) = \mathbb{E}_{t, a_{t:t+H_a}, \epsilon} \big[ \Vert \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} a_{t:t+T_a} + \epsilon \sqrt{1 - \bar \alpha_t}, t, o_{t-T_o:t}) \Vert^2 \big],\\ t \sim \mathcal{U}(\{1,\dots,T\}), \quad a_{t:t+T_a}, o_{t-T_o:t} \sim \mathcal{D}, \quad \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I}). \notag` +`\htmlId{diffusion-policy-objective}{\mathcal L(\theta) = \mathbb{E}_{t, a_{t:t+H_a}, \epsilon} \big[ \Vert \epsilon - \epsilon_\theta(\sqrt{\bar \alpha_t} a_{t:t+T_a} + \epsilon \sqrt{1 - \bar \alpha_t}, t, o_{t-T_o:t}) \Vert^2 \big],\\ t \sim \mathcal{U}(\{1,\dots,T\}), \quad a_{t:t+T_a}, o_{t-T_o:t} \sim \mathcal{D}, \quad \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I}). \notag}` $$ - Notice how in [eq:diffusion-policy-objective] the noise regressor is conditioned both on the latent variable rank $t$ *and* on a stack of previous observations $o_{t-T_o:t}$.  @chiDiffusionPolicyVisuomotor2024 claim the combination of (1) conditioning on a horizon of previous observations and (2) predicting multiple actions into the future allows DP to *commit to specific modes* in the data at inference time, which proves essential for good performance and avoiding undecisiveness. + Notice how in [diffusion-policy-objective] the noise regressor is conditioned both on the latent variable rank $t$ *and* on a stack of previous observations $o_{t-T_o-t}$.  @chiDiffusionPolicyVisuomotor2024 claim the combination of (1) conditioning on a horizon of previous observations and (2) predicting multiple actions into the future allows DP to *commit to specific modes* in the data at inference time, which proves essential for good performance and avoiding undecisiveness. -Figure 32 shows the convolution-based version of the architecture proposed by @chiDiffusionPolicyVisuomotor2024, illustrating inference on a single sample from $\mathcal D$ for simplicity. An arbitrarily noisy chunk of $H_a$ actions $\tilde a_{t:t+H_a}$ is mapped to a learned high-dimensional space. Similarily, both image observations and poses are embedded before being aggregated to the action embeddings. Then, a U-Net @ronnebergerUNetConvolutionalNetworks2015 is trained to regress the noise added into $\tilde a_{t:t+H_a}$, using observation conditioning information at every layer and seeking to optimize [eq:diffusion-policy-objective]. At inference time, the noise predictor is used to predict the quantity of noise at every $t \in [T, \dots, 0 ]$ and iteratively subtract it from $\tilde a_{t:t+T_a}$, reversing the diffusion process simulated in training conditioned on $o_{t-T_o:t}$ to predict $a_{t:t+T_a}$. +Figure 32 shows the convolution-based version of the architecture proposed by @chiDiffusionPolicyVisuomotor2024, illustrating inference on a single sample from $\mathcal D$ for simplicity. An arbitrarily noisy chunk of $H_a$ actions $\tilde a_{t:t+H_a}$ is mapped to a learned high-dimensional space. Similarily, both image observations and poses are embedded before being aggregated to the action embeddings. Then, a U-Net @ronnebergerUNetConvolutionalNetworks2015 is trained to regress the noise added into $\tilde a_{t:t+H_a}$, using observation conditioning information at every layer and seeking to optimize [diffusion-policy-objective]. At inference time, the noise predictor is used to predict the quantity of noise at every $t \in [T, \dots, 0 ]$ and iteratively subtract it from $\tilde a_{t:t+T_a}$, reversing the diffusion process simulated in training conditioned on $o_{t-T_o:t}$ to predict $a_{t:t+T_a}$. -Training using 50-150 demos (15-60 minutes of teleoperation data) DP achieves strong performance on a variety of simulated and real-world tasks, including dexterous and deformable manipulation tasks such as sauce pouring and mat unrolling. Notably, the authors ablated the relevance of using RGB camera streams as input to their policy, and observed how high frame-rate visual observations can be used to attain performance (measured as success rate) comparable to that of state-based policies, typically trained in simulation with priviledged information not directly available in real-world deployments. As high-frame rate RGB inputs naturally accomodate for dynamic, fast changing environments, @chiDiffusionPolicyVisuomotor2024’s conclusion offers significant evidence for learning streamlined control policies directly from pixels. In their work, @chiDiffusionPolicyVisuomotor2024 also ablate the performance of DP against their baseline against the size of the dataset collected, showing that DP outperforms the considered baseline for every benchmark size considered. Further, to accelerate inference, @chiDiffusionPolicyVisuomotor2024 employ Denoising Diffusion Implicit Models @songDenoisingDiffusionImplicit2022, a variant of Denoising Diffusion Probabilistic Models @hoDenoisingDiffusionProbabilistic2020 (DDPM) adopting a strictly deterministic denoising paradigm (differently from DDPM’s natively stochastic one) inducing the same final distribution’s as DDPM’s, and yet resulting in 10 times less denoising steps at inference time @chiDiffusionPolicyVisuomotor2024. Across a range of simulated and real-world tasks, @chiDiffusionPolicyVisuomotor2024 find DPs particularly performant when implementing a transformer-based network as $\epsilon_\theta$, although the authors note the increased sensitivity of transformer networks to hyperparameters and thus explicitly recommend starting out with a simpler, convolution-based architecture for diffusion (Figure 32), which are however reported to be biased towards learning low-frequency components @tancikFourierFeaturesLet2020 and thus may prove more challenging to train with non-smooth action sequences. +Training using 50-150 demos (15-60 minutes of teleoperation data) DP achieves strong performance on a variety of simulated and real-world tasks, including dexterous and deformable manipulation tasks such as sauce pouring and mat unrolling. Notably, the authors ablated the relevance of using RGB camera streams as input to their policy, and observed how high frame-rate visual observations can be used to attain performance (measured as success rate) comparable to that of state-based policies, typically trained in simulation with priviledged information not directly available in real-world deployments. As high-frame rate RGB inputs naturally accomodate for dynamic, fast changing environments, @chiDiffusionPolicyVisuomotor2024’s conclusion offers significant evidence for learning streamlined control policies directly from pixels. In their work, @chiDiffusionPolicyVisuomotor2024 also ablate the performance of DP against their baseline against the size of the dataset collected, showing that DP outperforms the considered baseline for every benchmark size considered. Further, to accelerate inference, @chiDiffusionPolicyVisuomotor2024 employ Denoising Diffusion Implicit Models @songDenoisingDiffusionImplicit2022, a variant of Denoising Diffusion Probabilistic Models @hoDenoisingDiffusionProbabilistic2020 (DDPM) adopting a strictly deterministic denoising paradigm (differently from DDPM’s natively stochastic one) inducing the same final distribution’s as DDPM’s, and yet resulting in 10 times less denoising steps at inference time @chiDiffusionPolicyVisuomotor2024. Across a range of simulated and real-world tasks, @chiDiffusionPolicyVisuomotor2024 find DPs particularly performant when implementing a transformer-based network as $\epsilon_\theta$, although the authors note the increased sensitivity of transformer networks to hyperparameters and thus explicitly recommend starting out with a simpler, convolution-based architecture for diffusion (Figure 32), which are however reported to be biased towards learning low-frequency components @tancikFourierFeaturesLet2020 and thus may prove more challenging to train with non-smooth action sequences. #### Code Example: Learning Diffusion Policies @@ -1003,9 +976,9 @@ Typically, the robot executes the entire action chunk $\mathbf{A}_t$, before a n A less resource-intensive approach is to entirely exhaust the chunk $\mathbf{A}$ before predicting a new chunk of actions, a strategy we refer to as *synchronous* (sync) inference. Sync inference efficiently allocates computation every $H_a$ timesteps, resulting in a reduced average computational burden at control time. In contrast, it inherently hinders the responsiveness of robot systems, introducing blind lags due to the robot being *idle* while computing $\mathbf{A}$. -We directly assess the lack of adaptiveness of robot systems due to acting open-loop, and the presence of lags at runtime by decoupling action chunk prediction $\mathbf{A}$ from action execution $a_t \gets \text{PopFront}(\mathbf{A}_t)$, developing an *asynchronous* (async) inference stack ([alg:async-inference]), whereby a $\text{RobotClient}$ sends an observation $o_t$ to a $\text{PolicyServer}$, receiving an action chunk $\mathbf{A}_t$ once inference is complete (33). In this, we avoid execution lags by triggering chunk prediction while the control loop is still consuming a previously available queue, aggregating it with the newly incoming queue whenever available. In turn, async-inference tightens the loop between action prediction and action execution, by increasing the frequency at which observations are processed for chunk prediction. Crucially, decoupling action prediction from action execution also directly allows to allocate more computational resources on a remote policy server sending actions to the robot client over networks, something which may prove very effective in resource-constrained scenarios such as low-power robots. +We directly assess the lack of adaptiveness of robot systems due to acting open-loop, and the presence of lags at runtime by decoupling action chunk prediction $\mathbf{A}$ from action execution $a_t \gets \text{PopFront}(\mathbf{A}_t)$, developing an *asynchronous* (async) inference stack ([alg-async-inference]), whereby a $\text{RobotClient}$ sends an observation $o_t$ to a $\text{PolicyServer}$, receiving an action chunk $\mathbf{A}_t$ once inference is complete (33). In this, we avoid execution lags by triggering chunk prediction while the control loop is still consuming a previously available queue, aggregating it with the newly incoming queue whenever available. In turn, async-inference tightens the loop between action prediction and action execution, by increasing the frequency at which observations are processed for chunk prediction. Crucially, decoupling action prediction from action execution also directly allows to allocate more computational resources on a remote policy server sending actions to the robot client over networks, something which may prove very effective in resource-constrained scenarios such as low-power robots. -
+
- +
@@ -1037,7 +1010,7 @@ Algorithmically, we attain (1) on the -side by consuming actions from a readily Interestingly, the behavior of async inference can be studied analytically. First, let $\ell$ be a random variable modeling the time needed to receive an action chunk $\mathbf{A}$ after sending an observation $o$, i.e. the sum of (1) the time to send across the observation $o$ between the and , $t_{C \to S}$ (2) the inference latency on the , $\ell_S$ and (3) the time to send $\mathbf{A}$ between the and , $t_{S \to C}$. Assuming independence, $\mathbb E [\ell] = \mathbb E[t_{C \to S}] + \mathbb E[\ell_S] + \mathbb E[t_{S \to C}]$ which can be further simplified to $\mathbb E[\ell] \simeq \mathbb E[\ell_S]$, assuming communication time is (1) equal in both directions and (2) negligible with respect to the inference latency. Second, let $\Delta t$ be the environment’s control cycle. With a real-world frame-rate of 30 frames per second, $\Delta t=33\text{ms}$. Consequently, exhausted queues at runtime-i.e. being idle awaiting for a new chunk-are avoided for $g \geq \frac{\mathbb E[\ell_S] / \Delta t}{H_a}$. In this, the queue threshold $g$ plays a major role relatively to the availability of actions to the . -34 illustrates how the size of the action chunk $\lvert \mathbf{A}_t \rvert$ evolves over time for three representative values of $g$, detailing the following key scenarios: +34 illustrates how the size of the action chunk $\lvert \mathbf{A}_t \rvert$ evolves over time for three representative values of $g$, detailing the following key scenarios: - **Sequential limit $(g=0)$.** The client drains the entire chunk before forwarding a new observation to the server. During the round-trip latency needed to compute the next chunk, the queue is empty, leaving the robot *incapable of acting*. This reproduces the behavior of a fully sequential deployment and results in an average of $\mathbb E[\ell_S]$ idle seconds. @@ -1045,7 +1018,7 @@ Interestingly, the behavior of async inference can be studied analytically. Firs - **Compute-intensive limit $(g=1)$.** As an extreme case, and in keeping with @zhaoLearningFineGrainedBimanual2023, an observation is sent at *every* timestep. The queue is therefore almost always filled, with only a minor saw-tooth due to$\Delta t/\mathbb E[\ell_s] < 1$. While maximally reactive, this setting incurs one forward pass per control tick and can prove prohibitively expensive on limited hardware. Importantly, because the client is consuming actions while the server computes the next chunk, the available queue never gets filled again. -
+
Action queue size evolution at runtime for various levels of g when (A) not filtering out observation based on joint-space similarity and (B) filtering out near-duplicates observation, measuring their similarity in joint-space.
-34 emphasizes the trade-off governed by $g$: small values place result in idle periods, whereas $g\approx 1$ assumes a highly accurate model and pays a significant compute price. In practice, choosing $g\in(0,1)$ allows to strike a balance between reactivity against resource budgets. If not for the aforementioned similarity filter, the would send observations for processing every $(1 - g) H_a \cdot \Delta t$ seconds, receiving a new chunk of actions every $(1 - g) H_a \cdot \Delta t + \mathbb E[\ell_S]$, on average. The presence of the observation similarity filter dilates this processing time, and serves the scope of avoiding the robot stalling due to the queue being constantly integrated with an incoming, nearly identical, action chunk. In particular, 34 results in a queue which is filled with incoming actions *unless* near-duplicate observations are filtered out from the processing pipeline. For clarity, the red arrow in 34 highlights a timestep where the observation similarity mechanism is bypassed, forcing a (nearly identical) observation to be processed as the queue results empty. +34 emphasizes the trade-off governed by $g$: small values place result in idle periods, whereas $g\approx 1$ assumes a highly accurate model and pays a significant compute price. In practice, choosing $g\in(0,1)$ allows to strike a balance between reactivity against resource budgets. If not for the aforementioned similarity filter, the would send observations for processing every $(1 - g) H_a \cdot \Delta t$ seconds, receiving a new chunk of actions every $(1 - g) H_a \cdot \Delta t + \mathbb E[\ell_S]$, on average. The presence of the observation similarity filter dilates this processing time, and serves the scope of avoiding the robot stalling due to the queue being constantly integrated with an incoming, nearly identical, action chunk. In particular, 34 results in a queue which is filled with incoming actions *unless* near-duplicate observations are filtered out from the processing pipeline. For clarity, the red arrow in 34 highlights a timestep where the observation similarity mechanism is bypassed, forcing a (nearly identical) observation to be processed as the queue results empty. #### Code Example: Using Async Inference @@ -1076,13 +1049,13 @@ TL;DR Openly available large scale datasets and the development of stable, expre -The advent of large models trained on internet-scale datasets has drastically influenced fields like Computer Vision (CV) and Natural Language Processing (NLP), shifting the paradigm towards combining (1) an initial, task-agnostic large-scale pre-training stage and a (2) task-specific, adjustment phase. The pre-training/adaptation paradigm has now largely replaced more classic approaches consisting of task-specific data collection, curation and model training in many subdomains within CV and NLP, motivated by the main drawback of limited scalability for *task-specific approaches*, traditionally labor intensive. Factors including (1) the advancements in generalist models learned with self-supervision for perception @oquabDINOv2LearningRobust2024 or semantic understanding @devlinBERTPretrainingDeep2019 and (2) the popularization collective efforts to aggregate large-scale openly available datasets @collaborationOpenXEmbodimentRobotic2025, @khazatskyDROIDLargeScaleInTheWild2025 are increasingly pushing the field of robot learning towards the pre-train-and-adapt paradigm. This shift taps into the long-standing challenge of developing generalist robot policies, and holds the premise to surpass traditionally siloed approaches to robotics problems and develop a *foundation robotics model*. While Section [sec:learning-bc-single] introduced methods for learning *single-task policies* such as ACT or Diffusion Policy, in this section we present advancements in developing *generalist, multi-task, policies*, capable of performing a wide range of tasks across different environments and embodiments, and guided by unstructured instructions given via natural language. +The advent of large models trained on internet-scale datasets has drastically influenced fields like Computer Vision (CV) and Natural Language Processing (NLP), shifting the paradigm towards combining (1) an initial, task-agnostic large-scale pre-training stage and a (2) task-specific, adjustment phase. The pre-training/adaptation paradigm has now largely replaced more classic approaches consisting of task-specific data collection, curation and model training in many subdomains within CV and NLP, motivated by the main drawback of limited scalability for *task-specific approaches*, traditionally labor intensive. Factors including (1) the advancements in generalist models learned with self-supervision for perception @oquabDINOv2LearningRobust2024 or semantic understanding @devlinBERTPretrainingDeep2019 and (2) the popularization collective efforts to aggregate large-scale openly available datasets @collaborationOpenXEmbodimentRobotic2025, @khazatskyDROIDLargeScaleInTheWild2025 are increasingly pushing the field of robot learning towards the pre-train-and-adapt paradigm. This shift taps into the long-standing challenge of developing generalist robot policies, and holds the premise to surpass traditionally siloed approaches to robotics problems and develop a *foundation robotics model*. While Section [learning-bc-single] introduced methods for learning *single-task policies* such as ACT or Diffusion Policy, in this section we present advancements in developing *generalist, multi-task, policies*, capable of performing a wide range of tasks across different environments and embodiments, and guided by unstructured instructions given via natural language. 35). +The remarkable success of foundation models in NLP and CV is predicated on two core principles: architectural innovation and joint data-compute scaling. The transformer architecture proved instrumental in capturing long-range dependencies in sequential data such as text, and its stability and expressivity made it the *de facto* standard for modern large-scale models trained on internet-scale amounts of data. In stark contrast with popular NLP @raffelExploringLimitsTransfer2023 and CV @ImageNet_VSS09 general-purpose datasets, the field of robotics has historically developed around task-specific datasets which hinders scalability across problems, resulting in a concrete data deficit for general-purpose robot learning. Unlike the wealth of relatively readily available text and images on the internet, robotics data is intrinsically embodied--datasets collected for a manipulation robot typically differ entirely from locomotion datasets. Further, datasets consisting of expert demonstrations are (1) intrinsically expensive to collect (2) and notoriously heterogeneous--different human experts may perform the same task optimally yet in very different ways. In particular, since each expert trajectory is tied to a specific robot platform and the operating conditions of its environment and task, data heterogeneity has long posed a *methodological* challenge for scaling robotics datasets via aggregation. Beyond this, heterogeneity also raises *conceptual* issues: naively mixing data across embodiments can induce negative transfer, as control strategies developed in isolation for different robot systems in different environments may even conflict when combined. Thus, the high degree of fragmentation of robotics datasets and tasks has traditionally led to the development of *specialist* policies, trained on small, task-specific datasets, and which excel at their designated task but fail to generalize to new situations (Figure 35). -Motivated by the pursuit of generalist robot policies, the research community started investigating what and how to integrate from other domains within ML. Figure 36 shows a timeline of some of the most popular contributions attempting at developing generalist policies. Starting from BC-Zero, a latent variable model trained on 25K+ demonstrations, the field has now evolved into $\pi_0$, a transformer-based model trained on 10M+ demonstrations and exhibiting strong few-shot capabilities across tasks and embodiments. For starters, Robotics Transformer 1 (RT-1) @brohanRT1RoboticsTransformer2023 represented a significant step in the direction of developing a generalist robot policies over prior work including (1) BC-Zero @jangBCZZeroShotTask2022 and (2) Gato @reedGeneralistAgent2022, in that @brohanRT1RoboticsTransformer2023 uses a much larger and diverse set of training tasks compared to both BC-Zero and Gato. In particular, RT-1 uses a transformer architecture, and is trained on as many as 130k human-recorded trajectories collected over 13 robots in the span on 17 months. RT-1 learns to process a history of camera images and a natural language instruction, and feeds the resulting sequence of high-dimensional tokens to a transformer, trained using a *classification loss on a discretized actions space* consisting of 6 256 bins, each for each joint of a 6-dof robotic arm. +Motivated by the pursuit of generalist robot policies, the research community started investigating what and how to integrate from other domains within ML. Figure 36 shows a timeline of some of the most popular contributions attempting at developing generalist policies. Starting from BC-Zero, a latent variable model trained on 25K+ demonstrations, the field has now evolved into $\pi_0$, a transformer-based model trained on 10M+ demonstrations and exhibiting strong few-shot capabilities across tasks and embodiments. For starters, Robotics Transformer 1 (RT-1) @brohanRT1RoboticsTransformer2023 represented a significant step in the direction of developing a generalist robot policies over prior work including (1) BC-Zero @jangBCZZeroShotTask2022 and (2) Gato @reedGeneralistAgent2022, in that @brohanRT1RoboticsTransformer2023 uses a much larger and diverse set of training tasks compared to both BC-Zero and Gato. In particular, RT-1 uses a transformer architecture, and is trained on as many as 130k human-recorded trajectories collected over 13 robots in the span on 17 months. RT-1 learns to process a history of camera images and a natural language instruction, and feeds the resulting sequence of high-dimensional tokens to a transformer, trained using a *classification loss on a discretized actions space* consisting of 6 256 bins, each for each joint of a 6-dof robotic arm. Perhaps motivated by the contemporary successes of the transformer architecture in both CV and NLP, the same group of authors investigated using a discrete output space to model--inherently continuous--quantities such as actions, leveraging a (1) more powerful architecture and (2) scaling up the dataset used . In RT-2, @brohanRT2VisionLanguageActionModels2023 propose inheriting internet-scale semantic knowledge from large-scale multi-modal datasets to learn a single, *unified model* for robotics control. Such a model, termed *Vision-Language-Action* (VLA) in the original RT-2 paper, effectively casts robot control as a language modeling problem, and in particular as a Visual Question-Answering (VQ) task, whereby the output token space used to represent *string* tokens is shared with the *8-bits tokens* used to represent the 256 actuation levels of a 6-dof robot joint. In their work, @brohanRT2VisionLanguageActionModels2023 propose co-fine-tuning then-leading large-scale VLMs such as PaLIX @chenPaLIXScalingMultilingual2023 or PaLM-E @driessPaLMEEmbodiedMultimodal2023 on a mix of web and robotics data, thus complementing VQtraining with robotics-specific signal, learning to directly output robot actions in a shared token space for visual and language inputs. Using large models trained on internet-scale data as backbones for VLAs allows models to tap into the rich semantic knowledge embedded in the VLM’s parameters, interpret new commands as well as recognize unseen objects by connecting them to concepts acquired while pre-training. For instance, @brohanRT2VisionLanguageActionModels2023 show that while RT-2 has never been explicitly trained to repurpose tools for a hammering task, it can still combine its semantic understanding of images, so that when asked which object between (1) a piece of paper, (2) a pair of headphones or (3) a rock may be used instead of a hammer, it answers correctly, (3). @@ -1114,13 +1087,13 @@ The success of large, proprietary models like RT-1 and RT-2, highlighted a growi src={ch5_trends} zoomable downloadable - id="fig:ch5-trends" + id="ch5-trends" layout="fixed" alt="Robot learning is undergoing a paradigmatic shift: centralized data collections (A, left) are increa..." caption={'Robot learning is undergoing a paradigmatic shift: centralized data collections (A, left) are increasingly larger, often comprising Ms of demonstrations, and (A, right) decentralized approaches to data collection are also rising as an alternative for large scale data collection. (B) Generalist models are also becoming increasingly smaller and easier to run on limited hardware.'} /> -Figure 37 illustrates graphically the two most relevant trends in modern robot learning. As datasets collected via centralized, cross-institutions cooperation of increasing size are made available for the research community, decentralized datasets collected by individual researchers and practitioners have also gained traction recently, closing the gap with academic benchmarks thanks to community-contributed datasets. Further, models used across tasks and embodiments are also becoming much more compute-efficient, and as a result the models’ size has been consistently reducing over time, with consequent gains for autonomous robots in real-world, resource-constrained environments. +Figure 37 illustrates graphically the two most relevant trends in modern robot learning. As datasets collected via centralized, cross-institutions cooperation of increasing size are made available for the research community, decentralized datasets collected by individual researchers and practitioners have also gained traction recently, closing the gap with academic benchmarks thanks to community-contributed datasets. Further, models used across tasks and embodiments are also becoming much more compute-efficient, and as a result the models’ size has been consistently reducing over time, with consequent gains for autonomous robots in real-world, resource-constrained environments. ### Modern VLAs @@ -1142,28 +1115,28 @@ $\pi_0$ @blackp0VisionLanguageActionFlow2024 introduce a VLA consisting of a Mo src={ch5_pi0} zoomable downloadable - id="fig:ch5-pi0" + id="ch5-pi0" layout="fixed" alt="The π 0 architecture, as in @blackp0VisionLanguageActionFlow2024. Vision and language tokens are rou..." caption={'The π 0 architecture, as in @blackp0VisionLanguageActionFlow2024. Vision and language tokens are routed to a VLM backbone which is prevented from attending robot proprioperceptive states and action tokens, which are instead routed to a smaller subset of weights within the architecture. The architecture is trained with Flow Matching on 10M+ trajectories from a mixture of closed and openly available datasets.'} /> -Concretely, $\pi_0$ is a unified transformer with two disjoint sets of weights $\phi, \theta$. A larger VLM backbone $p_\phi$ initialized from Gemma 2.6B processes multiple image frames obtained from multiple cameras points $[\{ I_t \}_{t=1}^n]$, as well as a language instruction $[\ell_t]$ used to describe the task considered. Concurrently, a 300M-parameter *action expert* based on a similar transformer architecture is used processes the robot proprioperceptive state $q_t$ and an action chunk $a_{t:t+H_a}$ (Figure 38). The different expert networks operate separately in processing the respective inputs and turning them into query, key and value matrices, and only share information between each other via self-attention layers. The outputs from the VLM backbone are disregarded, while the vector field regressed by the action expert is used to iteratively refine the action process. In particular, $\pi_0$uses a *blockwise causal attention mask* over tokens belonging to three separate blocks: (1) image and language tokens $\mathcal T_i$ obtained from $[\{ I_t \}_{t=1}^n, \ell_t]$, (2) proprioperceptive tokens $\mathcal T_q$ obtained from $q_t$, and (3) the action tokens $\mathcal T_a$ for items in the chunk $a^{\tau}_{t:t+H_a}$ at time $\tau$ in the flow-matching process. Notably, *within* each block the attention operations are bidirectional, while across blocks, future blocks are masked out. Formally, this corresponds to using the attention mask $\mathbf{A} = \bordermatrix{ \mathcal{T}_i \mathcal{T}_q \mathcal{T}_a \cr \mathcal{T}_i \mathbf{1} \mathbf{0} \mathbf{0} \cr \mathcal{T}_q \mathbf{1} \mathbf{1} \mathbf{0} \cr \mathcal{T}_a \mathbf{1} \mathbf{1} \mathbf{1} \cr }, \quad \mathbf{1}: \text{Bidirectional Attention}, \ \mathbf{0}: \text{Masked Attention}$ Note how *intra*-block directional attention allows tokens to communicate freely, while *inter*-block communication is mediated by the attention mask $\mathbf{A}$. *Blockwise causal masking* effectively prevents the pre-trained perception-language tokens from attending to robotics-tokens, likely out of distribution for VLM backbones traditionally trained on large corpora of internet, non-robotics, data. Crucially, because communication is obstructed between image-language tokens, proprioperceptive and action tokens, one can cache keys and values across denoising steps at runtime time, incuring in a reduced computational footprint and faster inference. +Concretely, $\pi_0$ is a unified transformer with two disjoint sets of weights $\phi, \theta$. A larger VLM backbone $p_\phi$ initialized from Gemma 2.6B processes multiple image frames obtained from multiple cameras points $[\{ I_t \}_{t=1}^n]$, as well as a language instruction $[\ell_t]$ used to describe the task considered. Concurrently, a 300M-parameter *action expert* based on a similar transformer architecture is used processes the robot proprioperceptive state $q_t$ and an action chunk $a_{t:t+H_a}$ (Figure 38). The different expert networks operate separately in processing the respective inputs and turning them into query, key and value matrices, and only share information between each other via self-attention layers. The outputs from the VLM backbone are disregarded, while the vector field regressed by the action expert is used to iteratively refine the action process. In particular, $\pi_0$uses a *blockwise causal attention mask* over tokens belonging to three separate blocks: (1) image and language tokens $\mathcal T_i$ obtained from $[\{ I_t \}_{t=1}^n, \ell_t]$, (2) proprioperceptive tokens $\mathcal T_q$ obtained from $q_t$, and (3) the action tokens $\mathcal T_a$ for items in the chunk $a^{\tau}_{t:t+H_a}$ at time $\tau$ in the flow-matching process. Notably, *within* each block the attention operations are bidirectional, while across blocks, future blocks are masked out. Formally, this corresponds to using the attention mask $\mathbf{A} = \bordermatrix{ \mathcal{T}_i \mathcal{T}_q \mathcal{T}_a \cr \mathcal{T}_i \mathbf{1} \mathbf{0} \mathbf{0} \cr \mathcal{T}_q \mathbf{1} \mathbf{1} \mathbf{0} \cr \mathcal{T}_a \mathbf{1} \mathbf{1} \mathbf{1} \cr }, \quad \mathbf{1}: \text{Bidirectional Attention}, \ \mathbf{0}: \text{Masked Attention}$ Note how *intra*-block directional attention allows tokens to communicate freely, while *inter*-block communication is mediated by the attention mask $\mathbf{A}$. *Blockwise causal masking* effectively prevents the pre-trained perception-language tokens from attending to robotics-tokens, likely out of distribution for VLM backbones traditionally trained on large corpora of internet, non-robotics, data. Crucially, because communication is obstructed between image-language tokens, proprioperceptive and action tokens, one can cache keys and values across denoising steps at runtime time, incuring in a reduced computational footprint and faster inference. In $\pi_0$, both the VLM backbone and action expert are update using a *flow matching* loss, and in particular are updated minimizing: $$ -`\mathcal{L}(\phi, \theta) = \mathbb{E}_{\tau, \epsilon, o_t, a_{t:t+H_a}}\Big[ \big\Vert v_\theta(\underbrace{\tau a_{t:t+H_a} + (1-\tau) \epsilon}_{\tilde a_{t:t+H_a}},\, o_t,\, \tau) - (\epsilon - a_{t:t+H_a}) \big\Vert^2 \Big],\\ \tau \sim \mathrm{Beta}_{[0,s]}(1.5,1), \quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad o_t, a_{t:t+H_a} \sim \mathcal D \notag` +`\htmlId{pi0-loss}{\mathcal{L}(\phi, \theta) = \mathbb{E}_{\tau, \epsilon, o_t, a_{t:t+H_a}}\Big[ \big\Vert v_\theta(\underbrace{\tau a_{t:t+H_a} + (1-\tau) \epsilon}_{\tilde a_{t:t+H_a}},\, o_t,\, \tau) - (\epsilon - a_{t:t+H_a}) \big\Vert^2 \Big],\\ \tau \sim \mathrm{Beta}_{[0,s]}(1.5,1), \quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad o_t, a_{t:t+H_a} \sim \mathcal D \notag}` $$ - Where the experts parametrized by the separate weights $\phi, \theta$ interact with each other via self-attention layers only, so that the action expert $v_\theta$ internal computations also depend on the VLM backbone’s parameters $\phi$. Importantly, @blackp0VisionLanguageActionFlow2024 minimize [eq:pi0-loss] over both the multimodal backbone and action expert parameters, thus updating the internal representations of the VLM using BC-specific gradients. In contrast, @driessKnowledgeInsulatingVisionLanguageAction2025 later show that failing to insulate the VLM knowledge from the flow matching gradients actually harms performance. Inference is performed iteratively refining action chunks while numerically forward-integrating the vector field predicted by the action expert, + Where the experts parametrized by the separate weights $\phi, \theta$ interact with each other via self-attention layers only, so that the action expert $v_\theta$ internal computations also depend on the VLM backbone’s parameters $\phi$. Importantly, @blackp0VisionLanguageActionFlow2024 minimize [pi0-loss] over both the multimodal backbone and action expert parameters, thus updating the internal representations of the VLM using BC-specific gradients. In contrast, @driessKnowledgeInsulatingVisionLanguageAction2025 later show that failing to insulate the VLM knowledge from the flow matching gradients actually harms performance. Inference is performed iteratively refining action chunks while numerically forward-integrating the vector field predicted by the action expert, ``` math \begin{equation} a_{t:t+H_a}^{\tau + \delta} = a_{t:t+H_a}^{\tau } + \delta v_\theta(a_{t:t+H_a}^{\tau }, o_t) \end{equation} ``` -Flow matching  can be seen as a continuous time, detetrministic generalization of Diffusion and has proven effective in modeling highly complex multi-modal distributions, including those over images and video. In turn, its application to large-scale data collections of multiple human behaviors across tasks and embodiments appears rather consequential, particularly considering how it can enable faster inference via a reduced number of denoising steps--as few as 10, in $\pi_0$. In particular, the action expert is model as a conditional flow matching model. Each action token embeds a noisy action $a_i^{\tau} \in a^\tau_{t:t+H_a}$, alongside a sinusoidal encoding of the *flow process* timestep $\tau$. The action expert then leverages full bidirectional attention across the $H_a$ action tokens provided, as well as attends to previous proprioperceptive and image-language tokens as well. Interestingly, differently from a standard flow matching pipeline @lipmanFlowMatchingGenerative2023, $\tau$ is *not* sampled from a uniform distribution $\tau \sim \mathcal U([0,1])$, but rather obtained from $\tau \sim \textrm{Beta}(1.5,1)$ defined on the $[0,s], s<1$ support (Figure [fig:ch5-pi0-sampling-timesteps]). +Flow matching  can be seen as a continuous time, detetrministic generalization of Diffusion and has proven effective in modeling highly complex multi-modal distributions, including those over images and video. In turn, its application to large-scale data collections of multiple human behaviors across tasks and embodiments appears rather consequential, particularly considering how it can enable faster inference via a reduced number of denoising steps--as few as 10, in $\pi_0$. In particular, the action expert is model as a conditional flow matching model. Each action token embeds a noisy action $a_i^{\tau} \in a^\tau_{t:t+H_a}$, alongside a sinusoidal encoding of the *flow process* timestep $\tau$. The action expert then leverages full bidirectional attention across the $H_a$ action tokens provided, as well as attends to previous proprioperceptive and image-language tokens as well. Interestingly, differently from a standard flow matching pipeline @lipmanFlowMatchingGenerative2023, $\tau$ is *not* sampled from a uniform distribution $\tau \sim \mathcal U([0,1])$, but rather obtained from $\tau \sim \textrm{Beta}(1.5,1)$ defined on the $[0,s], s<1$ support (Figure [ch5-pi0-sampling-timesteps]).
@@ -1192,19 +1165,19 @@ VLAs remain in an early stage of development and are not yet as mature or widely src={ch5_smolvla} zoomable downloadable - id="fig:ch5-smolvla" + id="ch5-smolvla" layout="fixed" alt="The SmolVLA architecture, as in @shukorSmolVLAVisionLanguageActionModel2025. SmolVLA is a compact Mo..." caption={'The SmolVLA architecture, as in @shukorSmolVLAVisionLanguageActionModel2025. SmolVLA is a compact MoE model trained with flow matching to denoise action chunks. Vision and language tokens are fed to a VLM backbone, and share information with the proprioperceptive and action tokens via the attention mechanism. The attention expert interleaves SA and CA layers for further conditioning on the visual features from the VLM backbone. SmolVLA skips computations and reduces the visual tokens, resulting in 6x less memory usage than π 0 .'} /> -While encouraging efforts like $\pi_0$ @blackp0VisionLanguageActionFlow2024 demonstrate the feasibility of open VLA systems, they remain (1) large and compute-intensive and (2) dependent on closed datasets collected via centralized efforts on costly robotic platforms, ultimately hindering accessibility. SmolVLA mitigates both these accessibility issues by (1) prioritizing a compact, compute-efficient VLA design and (2) targeting community-contributed datasets on accessible robotic platforms such as the SO-100 and SO-101 arms. Similarly to $\pi_0$, SmolVLA (Figure 39) employs a MoE architecture combining a pretrained VLM backbone with a dedicated action expert, and trains with flow matching. To ensure efficiency and accessibility, SmolVLA adopts SmolVLM-2 @marafiotiSmolVLMRedefiningSmall2025 as its VLM backbone, considering SmolVLM-2’s reduced size and capability to process multiple image inputs alongside text items. SmolVLM-2 uses SigLIP @zhaiSigmoidLossLanguage2023 as vision encoder, producing visual features for a SmolLM2 language decoder @allalSmolLM2WhenSmol2025. Further, SmolVLA adopts a smaller action expert consisting of $\sim$100M parameters and an interleaved stack of self and cross-attention layers. To improve efficiency, the action expert adopts a reduced embedding dimension compared to the VLM backbone, resulting in $d_{v_\theta} = 0.75 d_{\text{VLM}}$. @shukorSmolVLAVisionLanguageActionModel2025’s design choices thus result in a much smaller size model compared to $\pi_0$, consisting of around 450M parameters versus $\pi_0$’s 3.3B parameters. +While encouraging efforts like $\pi_0$ @blackp0VisionLanguageActionFlow2024 demonstrate the feasibility of open VLA systems, they remain (1) large and compute-intensive and (2) dependent on closed datasets collected via centralized efforts on costly robotic platforms, ultimately hindering accessibility. SmolVLA mitigates both these accessibility issues by (1) prioritizing a compact, compute-efficient VLA design and (2) targeting community-contributed datasets on accessible robotic platforms such as the SO-100 and SO-101 arms. Similarly to $\pi_0$, SmolVLA (Figure 39) employs a MoE architecture combining a pretrained VLM backbone with a dedicated action expert, and trains with flow matching. To ensure efficiency and accessibility, SmolVLA adopts SmolVLM-2 @marafiotiSmolVLMRedefiningSmall2025 as its VLM backbone, considering SmolVLM-2’s reduced size and capability to process multiple image inputs alongside text items. SmolVLM-2 uses SigLIP @zhaiSigmoidLossLanguage2023 as vision encoder, producing visual features for a SmolLM2 language decoder @allalSmolLM2WhenSmol2025. Further, SmolVLA adopts a smaller action expert consisting of $\sim$100M parameters and an interleaved stack of self and cross-attention layers. To improve efficiency, the action expert adopts a reduced embedding dimension compared to the VLM backbone, resulting in $d_{v_\theta} = 0.75 d_{\text{VLM}}$. @shukorSmolVLAVisionLanguageActionModel2025’s design choices thus result in a much smaller size model compared to $\pi_0$, consisting of around 450M parameters versus $\pi_0$’s 3.3B parameters. Effectively, SmolVLA consumes multi-view RGB images, a natural-language instruction, and a projected sensorimotor state token as inputs, together with the noised *action chunk* $\tilde{a_{t:t+H_a}}$ the action expert $v_\theta$ is trained to denoise. In particular, robot proprioperceptive states are projected into a shared token space with the VLM to match $d_{\text{VLM}}$, and successively projected into the expert’s token space. Similarily to $\pi_0$, SmolVLA adopts separate experts communicating exclusively through self-attention layers, which do not employ the same blockwise causal masking in favour of a simple causal masking, resulting in a lower triangular attention mask. In contrast with $\pi_0$, the action expert interleaves *cross-attention* (CA) and *self-attention* (SA) layers, a choice shown to yield higher success and smoother action chunks in practice. While in the expert SA layers, tokens are used to obtain queries, keys and values, CA layers use action tokens only as queries, and instead project visual, language and proprioperceptive tokens in a shared action space to obtain keys and values. Notably, keys and values can be cached as well, resulting in performance gains at inference time. -SmolVLA trims both token and layer compute. First, it *reduces visual tokens* via pixel shuffle to a fixed budget of 64 tokens per frame, foregoing tiling used during VLM pretraining for runtime efficiency. Second, it *skips upper VLM layers*: the action expert consumes features from the first $N$ decoder layers, with $N=L/2$ providing a good speed-performance trade-off and effectively halving downstream compute for the larger part of SmolVLA. Beyond model compactness, SmolVLA also contributes an inference stack that decouples action prediction from execution for responsiveness on modest hardware (Section 4.4). +SmolVLA trims both token and layer compute. First, it *reduces visual tokens* via pixel shuffle to a fixed budget of 64 tokens per frame, foregoing tiling used during VLM pretraining for runtime efficiency. Second, it *skips upper VLM layers*: the action expert consumes features from the first $N$ decoder layers, with $N=L/2$ providing a good speed-performance trade-off and effectively halving downstream compute for the larger part of SmolVLA. Beyond model compactness, SmolVLA also contributes an inference stack that decouples action prediction from execution for responsiveness on modest hardware (Section 33). Departing from reliance on proprietary datasets, SmolVLA pretrains exclusively on 450+ *community datasets*, totaling 20K+ trajectories. Because instructions in community contributed dataset can be noisy or missing, the authors re-annotate tasks with a small off-the-shelf VLM using frames sampled from the dataset, and standardize camera viewpoints by mapping sources to a consistent top/wrist/side ordering. At inference, similarily to $\pi_0$, SmolVLA integrates flow over 10 steps, resulting in fast inference. SmolVLA proves effective across a range of both real-world and simulated environments, rivaling $\pi_0$while being close to 40% faster and consuming 6x less memory.